[ 
https://issues.apache.org/jira/browse/BEAM-7760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ning Kang updated BEAM-7760:
----------------------------
    Description: 
Cache only PCollections bound to user defined variables in a pipeline when 
running pipeline with interactive runner in jupyter notebooks.

[Interactive 
Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
 has been caching and using caches of "leaf" PCollections for interactive 
execution in jupyter notebooks.

The interactive execution is currently supported so that when appending new 
transforms to existing pipeline for a new run, executed part of the pipeline 
doesn't need to be re-executed. 

A PCollection is "leaf" when it is never used as input in any PTransform in the 
pipeline.

The problem with building caches and pipeline to execute around "leaf" is that 
when a PCollection is consumed by a sink with no output, the pipeline to 
execute built will miss the subgraph generating and consuming that PCollection.

An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty pipeline.

Caching around PCollections bound to user defined variables and replacing 
transforms with source and sink of caches could resolve the pipeline to execute 
properly under the interactive execution scenario. Also, cached PCollection now 
can trace back to user code and can be used for user data visualization if user 
wants to do it.

E.g.,
{code:java}
// ...
p = beam.Pipeline(interactive_runner.InteractiveRunner(),
                  options=pipeline_options)
messages = p | "Read" >> beam.io.ReadFromPubSub(subscription='...')
messages | "Write" >> beam.io.WriteToPubSub(topic_path)
result = p.run()
// ...
visualize(messages){code}
 The interactive runner automatically figures out that PCollection
{code:java}
messages{code}
created by
{code:java}
p | "Read" >> beam.io.ReadFromPubSub(subscription='...'){code}
should be cached and reused if the notebook user appends more transforms.
 And once the pipeline gets executed, the user could use any 
visualize(PCollection) module to visualize the data statically (batch) or 
dynamically (stream)

  was:
Cache only PCollections bound to user defined variables in a pipeline when 
running pipeline with interactive runner in jupyter notebooks.

[Interactive 
Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
 has been caching and using caches of "leaf" PCollections for interactive 
execution in jupyter notebooks.

The interactive execution is currently supported so that when appending new 
transforms to existing pipeline for a new run, executed part of the pipeline 
doesn't need to be re-executed. 

A PCollection is "leaf" when it is never used as input in any PTransform in the 
pipeline.

The problem with building caches and pipeline to execute around "leaf" is that 
when a PCollection is consumed by a sink with no output, the pipeline to 
execute built will miss the subgraph generating and consuming that PCollection.

An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty pipeline.

Caching around PCollections bound to user defined variables and replacing 
transforms with source and sink of caches could resolve the pipeline to execute 
properly under the interactive execution scenario. Also, cached PCollection now 
can trace back to user code and can be used for user data visualization if user 
wants to do it.

E.g.,
{code:java}
// ...
p = beam.Pipeline(interactive_runner.InteractiveRunner(),
                  options=pipeline_options)
messages = p | "Read" >> beam.io.ReadFromPubSub(subscription='...')
messages | "Write" >> beam.io.WriteToPubSub(topic_path)
result = p.run()
// ...
visualize(messages){code}
 The interactive runner automatically figures out that PCollection created by
{code:java}
p | "Read" >> beam.io.ReadFromPubSub(subscription='...'){code}
should be cached and reused if the notebook user appends more transforms.
And once the pipeline gets executed, the user could use any 
visualize(PCollection) module to visualize the data statically (batch) or 
dynamically (stream)


> Interactive Beam Caching PCollections bound to user defined vars in notebook
> ----------------------------------------------------------------------------
>
>                 Key: BEAM-7760
>                 URL: https://issues.apache.org/jira/browse/BEAM-7760
>             Project: Beam
>          Issue Type: New Feature
>          Components: examples-python
>            Reporter: Ning Kang
>            Priority: Major
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = beam.Pipeline(interactive_runner.InteractiveRunner(),
>                   options=pipeline_options)
> messages = p | "Read" >> beam.io.ReadFromPubSub(subscription='...')
> messages | "Write" >> beam.io.WriteToPubSub(topic_path)
> result = p.run()
> // ...
> visualize(messages){code}
>  The interactive runner automatically figures out that PCollection
> {code:java}
> messages{code}
> created by
> {code:java}
> p | "Read" >> beam.io.ReadFromPubSub(subscription='...'){code}
> should be cached and reused if the notebook user appends more transforms.
>  And once the pipeline gets executed, the user could use any 
> visualize(PCollection) module to visualize the data statically (batch) or 
> dynamically (stream)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to