[ 
https://issues.apache.org/jira/browse/BEAM-14211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Hulette updated BEAM-14211:
---------------------------------
    Description: 
The DataFrame API is completely deferred by design, it means users can quickly 
build up a pipeline of operations and explicitly execute it when they want to. 
However the pandas library is designed for eager execution on in-memory 
datasets, so many operations that users are accustomed to using in pandas are 
difficult or impossible to implement in a deferred context.

We should consider adding a set of "interactive" tools that are eagerly 
executed through tight integration with Interactive Beam (i.e. they call 
ib.collect() internally). All non-deferred-result, non-deferred-columns, and 
plotting operations (see [coverage 
status|https://docs.google.com/spreadsheets/d/1hHAaJ0n0k2tw465ORs5tfdy4Lg0DnGWIQ53cLjAhel0/edit])
 could be included in this set.

We need to make sure that these tools are easily distinguishable from standard, 
deferred operations. It's important that users are not surprised when these 
operations trigger execution. I won't prescribe a detailed design here yet, but 
some approaches to consider:
- All such operations are defined in a particular namespace ("interactive", 
"eager", "collect"?), i.e. users would access them as 
{{df.interactive.plot()}}, {{df.interactive.to_list()}}, 
{{df.interactive.pivot()}}.
- When used in a notebook context users could see some interaction (an "are you 
sure?" dialog, a page to enter parameters like project id, ...) that explains 
why execution was triggered and gives them an opportunity to abort.

Ideally this feature would not be tightly coupled to notebooks. Users might 
want to use these tools in an IPython interpreter, or in a python script (even 
plots could make sense in this context, the plot operation should return an 
object that the user can use to write the plot to a png).

  was:
The DataFrame API is completely deferred by design, it means users can quickly 
build up a pipeline of operations and explicitly execute it when they want to. 
However the pandas library is designed for eager execution on in-memory 
datasets, so many operations that users are accustomed to using in pandas are 
difficult or impossible to implement in a deferred context.

We should consider adding a set of "interactive" tools that are eagerly 
executed through tight integration with Interactive Beam (i.e. they call 
ib.collect() internally). All non-deferred-result, non-deferred-columns, and 
plotting operations (see [coverage 
status|https://docs.google.com/spreadsheets/d/1hHAaJ0n0k2tw465ORs5tfdy4Lg0DnGWIQ53cLjAhel0/edit])
 could be included in this set.

We need to make sure that these tools are easily distinguishable from standard, 
deferred operations. It's important that users are not surprised when these 
operations trigger execution. I won't prescribe a detailed design here yet, but 
some approaches to consider:
- All such operations are defined in a particular namespace ("interactive", 
"eager", "collect"?), i.e. users would access them as 
{{df.interactive.plot()}}, {{df.interactive.to_list()}}, 
{{df.interactive.pivot()}}.
- When used in a notebook context users should get some opportunity to abort 
(an "are you sure?" dialog, a page to enter parameters like project id, ...) 
that explains why execution was triggered.

Ideally this feature would not be tightly coupled to notebooks. Users might 
want to use these tools in an IPython interpreter, or in a python script (even 
plots could make sense in this context, the plot operation should return an 
object that the user can use to write the plot to a png).


> Add "interactive" DataFrame operations that eagerly trigger execution
> ---------------------------------------------------------------------
>
>                 Key: BEAM-14211
>                 URL: https://issues.apache.org/jira/browse/BEAM-14211
>             Project: Beam
>          Issue Type: Improvement
>          Components: dsl-dataframe
>            Reporter: Brian Hulette
>            Priority: P2
>
> The DataFrame API is completely deferred by design, it means users can 
> quickly build up a pipeline of operations and explicitly execute it when they 
> want to. However the pandas library is designed for eager execution on 
> in-memory datasets, so many operations that users are accustomed to using in 
> pandas are difficult or impossible to implement in a deferred context.
> We should consider adding a set of "interactive" tools that are eagerly 
> executed through tight integration with Interactive Beam (i.e. they call 
> ib.collect() internally). All non-deferred-result, non-deferred-columns, and 
> plotting operations (see [coverage 
> status|https://docs.google.com/spreadsheets/d/1hHAaJ0n0k2tw465ORs5tfdy4Lg0DnGWIQ53cLjAhel0/edit])
>  could be included in this set.
> We need to make sure that these tools are easily distinguishable from 
> standard, deferred operations. It's important that users are not surprised 
> when these operations trigger execution. I won't prescribe a detailed design 
> here yet, but some approaches to consider:
> - All such operations are defined in a particular namespace ("interactive", 
> "eager", "collect"?), i.e. users would access them as 
> {{df.interactive.plot()}}, {{df.interactive.to_list()}}, 
> {{df.interactive.pivot()}}.
> - When used in a notebook context users could see some interaction (an "are 
> you sure?" dialog, a page to enter parameters like project id, ...) that 
> explains why execution was triggered and gives them an opportunity to abort.
> Ideally this feature would not be tightly coupled to notebooks. Users might 
> want to use these tools in an IPython interpreter, or in a python script 
> (even plots could make sense in this context, the plot operation should 
> return an object that the user can use to write the plot to a png).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to