[ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=292216&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-292216
 ]

ASF GitHub Bot logged work on BEAM-7760:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 09/Aug/19 18:41
            Start Date: 09/Aug/19 18:41
    Worklog Time Spent: 10m 
      Work Description: KevinGG commented on issue #9278: [BEAM-7760] Added 
iBeam module
URL: https://github.com/apache/beam/pull/9278#issuecomment-520023376
 
 
   Cool! I'll not touch README yet since I'm constructing the building blocks 
of a new iBeam without integrating (thus changing) behaviors of existing iBeam. 
But once I make those integration and change, I'll update the README as changes 
go.
   
   For a broader design document, I composed a globally visible 
[design](https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/edit?usp=sharing)
 overview describing changes we are making to components around interactive 
runner. I'll share the document in our email thread too.
   
   The truth is since interactive runner is not yet a recognized runner as part 
of the Beam SDK (and it's fundamentally a wrapper around direct runner), we are 
not touching any Beam SDK components. We'll not change any behavior of existing 
Beam SDK and we'll try our best to keep it that way in the future.
   
   In the mean time, I'll work on other components orthogonal to Beam such as 
Pipeline Display and Data Visualization I mentioned in the design overview.
   
   Thanks!
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 292216)
    Time Spent: 2h 40m  (was: 2.5h)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> ----------------------------------------------------------------------------
>
>                 Key: BEAM-7760
>                 URL: https://issues.apache.org/jira/browse/BEAM-7760
>             Project: Beam
>          Issue Type: New Feature
>          Components: examples-python
>            Reporter: Ning Kang
>            Assignee: Ning Kang
>            Priority: Major
>          Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = beam.Pipeline(interactive_runner.InteractiveRunner(),
>                   options=pipeline_options)
> messages = p | "Read" >> beam.io.ReadFromPubSub(subscription='...')
> messages | "Write" >> beam.io.WriteToPubSub(topic_path)
> result = p.run()
> // ...
> visualize(messages){code}
>  The interactive runner automatically figures out that PCollection
> {code:java}
> messages{code}
> created by
> {code:java}
> p | "Read" >> beam.io.ReadFromPubSub(subscription='...'){code}
> should be cached and reused if the notebook user appends more transforms.
>  And once the pipeline gets executed, the user could use any 
> visualize(PCollection) module to visualize the data statically (batch) or 
> dynamically (stream)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to