[ 
https://issues.apache.org/jira/browse/BEAM-14211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17514972#comment-17514972
 ] 

Ning commented on BEAM-14211:
-----------------------------

In that case, should we consolidates the "_to_list()" and "_plot()" logic into 
a standalone module, then call the same logic from both 
"ib.collect()/ib.show()" and "df.eager.to_list()/df.eager.plot()"? 

They should have the same behavior, right?

 

Also, your plot logic, is it applicable to other non-df PCollections? Can we 
decouple it from the Dataframe api? I assume there are 2 parts:
 # Optimization where you implicitly append aggregation operations based on the 
plot to make;
 # Fetch the materialized data points for plotting.

Part 1 might be hard for other normal PCollections, InteractiveRunner should 
have its own implementation for it;

Part 2 should be generic and applicable to any data without extra development.

Is the assumption correct?

> Add "interactive" DataFrame operations that eagerly trigger execution
> ---------------------------------------------------------------------
>
>                 Key: BEAM-14211
>                 URL: https://issues.apache.org/jira/browse/BEAM-14211
>             Project: Beam
>          Issue Type: Improvement
>          Components: dsl-dataframe
>            Reporter: Brian Hulette
>            Priority: P2
>
> The DataFrame API is completely deferred by design, it means users can 
> quickly build up a pipeline of operations and explicitly execute it when they 
> want to. However the pandas library is designed for eager execution on 
> in-memory datasets, so many operations that users are accustomed to using in 
> pandas are difficult or impossible to implement in a deferred context.
> We should consider adding a set of "interactive" tools that are eagerly 
> executed through tight integration with Interactive Beam (i.e. they call 
> ib.collect() internally). All non-deferred-result, non-deferred-columns, and 
> plotting operations (see [coverage 
> status|https://docs.google.com/spreadsheets/d/1hHAaJ0n0k2tw465ORs5tfdy4Lg0DnGWIQ53cLjAhel0/edit])
>  could be included in this set.
> We need to make sure that these tools are easily distinguishable from 
> standard, deferred operations. It's important that users are not surprised 
> when these operations trigger execution. I won't prescribe a detailed design 
> here yet, but some approaches to consider:
> - All such operations are defined in a particular namespace ("interactive", 
> "eager", "collect"?), i.e. users would access them as 
> {{df.interactive.plot()}}, {{df.interactive.to_list()}}, 
> {{df.interactive.pivot()}}.
> - When used in a notebook context users could see some interaction (an "are 
> you sure?" dialog, a page to enter parameters like project id, ...) that 
> explains why execution was triggered and gives them an opportunity to abort.
> Ideally this feature would not be tightly coupled to notebooks. Users might 
> want to use these tools in an IPython interpreter, or in a python script 
> (even plots could make sense in this context, the plot operation should 
> return an object that the user can use to write the plot to a png).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to