damccorm opened a new issue, #21638:
URL: https://github.com/apache/beam/issues/21638

   The DataFrame API is completely deferred by design, it means users can 
quickly build up a pipeline of operations and explicitly execute it when they 
want to. However the pandas library is designed for eager execution on 
in-memory datasets, so many operations that users are accustomed to using in 
pandas are difficult or impossible to implement in a deferred context.
   
   We should consider adding a set of "interactive" tools that are eagerly 
executed through tight integration with Interactive Beam (i.e. they call 
ib.collect() internally). All non-deferred-result, non-deferred-columns, and 
plotting operations (see [coverage 
status](https://docs.google.com/spreadsheets/d/1hHAaJ0n0k2tw465ORs5tfdy4Lg0DnGWIQ53cLjAhel0/edit))
 could be included in this set.
   
   We need to make sure that these tools are easily distinguishable from 
standard, deferred operations. It's important that users are not surprised when 
these operations trigger execution. I won't prescribe a detailed design here 
yet, but some approaches to consider:
   - All such operations are defined in a particular namespace ("interactive", 
"eager", "collect"?), i.e. users would access them as `df.interactive.plot()`, 
`df.interactive.to_list()`, `df.interactive.pivot()`.
   - When used in a notebook context users could see some interaction (an "are 
you sure?" dialog, a page to enter parameters like project id, ...) that 
explains why execution was triggered and gives them an opportunity to abort.
   
   Ideally this feature would not be tightly coupled to notebooks. Users might 
want to use these tools in an IPython interpreter, or in a python script (even 
plots could make sense in this context, the plot operation should return an 
object that the user can use to write the plot to a png).
   
   Imported from Jira 
[BEAM-14211](https://issues.apache.org/jira/browse/BEAM-14211). Original Jira 
may contain additional context.
   Reported by: bhulette.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to