[
https://issues.apache.org/jira/browse/BEAM-7760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Anonymous updated BEAM-7760:
----------------------------
Status: Triage Needed (was: Resolved)
> Interactive Beam Caching PCollections bound to user defined vars in notebook
> ----------------------------------------------------------------------------
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
> Issue Type: New Feature
> Components: runner-py-interactive
> Reporter: Ning
> Assignee: Ning
> Priority: P2
> Fix For: 2.18.0
>
> Time Spent: 18h
> Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
> has been caching and using caches of "leaf" PCollections for interactive
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new
> transforms to existing pipeline for a new run, executed part of the pipeline
> doesn't need to be re-executed.
> A PCollection is "leaf" when it is never used as input in any PTransform in
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is
> that when a PCollection is consumed by a sink with no output, the pipeline to
> execute built will miss the subgraph generating and consuming that
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty
> pipeline.
> Caching around PCollections bound to user defined variables and replacing
> transforms with source and sink of caches could resolve the pipeline to
> execute properly under the interactive execution scenario. Also, cached
> PCollection now can trace back to user code and can be used for user data
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = beam.Pipeline(interactive_runner.InteractiveRunner(),
> options=pipeline_options)
> messages = p | "Read" >> beam.io.ReadFromPubSub(subscription='...')
> messages | "Write" >> beam.io.WriteToPubSub(topic_path)
> result = p.run()
> // ...
> visualize(messages){code}
> The interactive runner automatically figures out that PCollection
> {code:java}
> messages{code}
> created by
> {code:java}
> p | "Read" >> beam.io.ReadFromPubSub(subscription='...'){code}
> should be cached and reused if the notebook user appends more transforms.
> And once the pipeline gets executed, the user could use any
> visualize(PCollection) module to visualize the data statically (batch) or
> dynamically (stream)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)