[
https://issues.apache.org/jira/browse/BEAM-7760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ismaël Mejía updated BEAM-7760:
-------------------------------
Status: Open (was: Triage Needed)
> Interactive Beam Caching PCollections bound to user defined vars in notebook
> ----------------------------------------------------------------------------
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
> Issue Type: New Feature
> Components: examples-python
> Reporter: Ning Kang
> Priority: Major
>
> Cache only PCollections bound to user defined variables in a pipeline when
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
> has been caching and using caches of "leaf" PCollections for interactive
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new
> transforms to existing pipeline for a new run, executed part of the pipeline
> doesn't need to be re-executed.
> A PCollection is "leaf" when it is never used as input in any PTransform in
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is
> that when a PCollection is consumed by a sink with no output, the pipeline to
> execute built will miss the subgraph generating and consuming that
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty
> pipeline.
> Caching around PCollections bound to user defined variables and replacing
> transforms with source and sink of caches could resolve the pipeline to
> execute properly under the interactive execution scenario. Also, cached
> PCollection now can trace back to user code and can be used for user data
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = beam.Pipeline(interactive_runner.InteractiveRunner(),
> options=pipeline_options)
> messages = p | "Read" >> beam.io.ReadFromPubSub(subscription='...')
> messages | "Write" >> beam.io.WriteToPubSub(topic_path)
> result = p.run()
> // ...
> visualize(messages){code}
> The interactive runner automatically figures out that PCollection
> {code:java}
> messages{code}
> created by
> {code:java}
> p | "Read" >> beam.io.ReadFromPubSub(subscription='...'){code}
> should be cached and reused if the notebook user appends more transforms.
> And once the pipeline gets executed, the user could use any
> visualize(PCollection) module to visualize the data statically (batch) or
> dynamically (stream)
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)