[
https://issues.apache.org/jira/browse/BEAM-14211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17514959#comment-17514959
]
Ning commented on BEAM-14211:
-----------------------------
Thanks, Brian. Yes, we don't need to be tightly coupled. `ib.collect()` or
`ib.show()` could just be calling your df logic for any DeferredDataframe
object passed in, such as:
* ib.collect(df: DeferredDataframe) --> return
df.[interactive|eager|collect].to_list()
* ib.show(df: DeferredDataframe, **kwargs) -->
df.[interactive|eager|collect].plot(kwargs)
Your plot library doesn't need to be limited to matplotlib, you can use any
Python/JavaScript libraries, I've shared you a
[notebook|https://drive.google.com/file/d/1IlkuKpYa930k0-5PikqoCREaEC44g5nn/view?usp=sharing]
for potential options.
To display "confirmation dialog"/input-forms/any other web elements, you can
use ipywidgets (this is not specific to any notebook runtime):
[https://ipywidgets.readthedocs.io/en/latest/examples/Widget%20List.html]
We can even build it into our JupyterLab extension once we have your API
checked in:
[https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development#visualizing_the_data_through_the_interactive_beam_inspector]
Another incorporation we could have is to utilize the runtime of the job, for
example, use
* InteractiveRunner(underlying_runner=FlinkRunner(),
options=google_cloud_options)
> Add "interactive" DataFrame operations that eagerly trigger execution
> ---------------------------------------------------------------------
>
> Key: BEAM-14211
> URL: https://issues.apache.org/jira/browse/BEAM-14211
> Project: Beam
> Issue Type: Improvement
> Components: dsl-dataframe
> Reporter: Brian Hulette
> Priority: P2
>
> The DataFrame API is completely deferred by design, it means users can
> quickly build up a pipeline of operations and explicitly execute it when they
> want to. However the pandas library is designed for eager execution on
> in-memory datasets, so many operations that users are accustomed to using in
> pandas are difficult or impossible to implement in a deferred context.
> We should consider adding a set of "interactive" tools that are eagerly
> executed through tight integration with Interactive Beam (i.e. they call
> ib.collect() internally). All non-deferred-result, non-deferred-columns, and
> plotting operations (see [coverage
> status|https://docs.google.com/spreadsheets/d/1hHAaJ0n0k2tw465ORs5tfdy4Lg0DnGWIQ53cLjAhel0/edit])
> could be included in this set.
> We need to make sure that these tools are easily distinguishable from
> standard, deferred operations. It's important that users are not surprised
> when these operations trigger execution. I won't prescribe a detailed design
> here yet, but some approaches to consider:
> - All such operations are defined in a particular namespace ("interactive",
> "eager", "collect"?), i.e. users would access them as
> {{df.interactive.plot()}}, {{df.interactive.to_list()}},
> {{df.interactive.pivot()}}.
> - When used in a notebook context users could see some interaction (an "are
> you sure?" dialog, a page to enter parameters like project id, ...) that
> explains why execution was triggered and gives them an opportunity to abort.
> Ideally this feature would not be tightly coupled to notebooks. Users might
> want to use these tools in an IPython interpreter, or in a python script
> (even plots could make sense in this context, the plot operation should
> return an object that the user can use to write the plot to a png).
--
This message was sent by Atlassian Jira
(v8.20.1#820001)