Hi Ning! Thanks for the design doc and the explanations. I think I can grasp some of the concepts, but it is not yet 100% clear to me why it's necessary to define a new abstraction to have interactivity. Could you elaborate? Perhaps as a section in the doc? : )
A lot of the motivation for this doc seems related to how we decide which PCollections to cache - so as to avoid rerunning parts of a pipeline whenever a user decides to visualize specific parts. I think that makes sense (and probably helps to have interactivity on streaming). I agree that it's a little odd that InteractiveRunner receives an underlying runner. That certainly suggests that the functionality is orthogonal. So, in short: I think my feedback is similar to others: Can you justify further (or reconsider) why pipeline creation and execution need to be different? I can see what's the need for the watch. Can you also tell us more about how a user would use visualize? Do they pass the kind of plot to have? Thanks! -P. On Wed, Aug 14, 2019 at 12:03 PM Ning Kang <[email protected]> wrote: > Q1: > The document is shared ( > https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/edit?usp=sharing). > If inside Google, short link (go/ibeam-external > <https://goto.google.com/ibeam-external>). I cannot share internal > documents, but you can reach out if you need internal engineering plan. > > Q2: > Yes, watch() is optimization used for using visualization() and building > further on the pipeline. And the user doesn't need to call it if they > simply define the pipeline in the notebook. > > Q3 and Q4: > I'm only focusing on direct runner as underlying runner. We'll get rid of > many of existing interactive Beam implementation. We can't provide > portability for interactivity. Users can run the pipeline with other > runners though due to the pipeline portability. > Our work is to reduce the new concepts a user needs to know when they want > to run interactive Beam. The implementation could be arbitrarily > complicated and open sourced though. Currently, the interactive runner > looks like as if it's supporting all kinds of underlying runners. We want > to rid of it too. > > On 2019/08/08 00:01:06, Ahmet Altay <[email protected]> wrote: > > Ning, thank you for the heads up. > > > > All, this is a proposed work for improving interactive Beam experience. > As > > mentioned in Ning's email, new concepts are being introduced. And in > > addition iBeam as a name is used as a new reference. I hope that bringing > > the discussion to the mailing list will give it the additional > > visibility and more people could share their feedback. > > > > (cc'ing a few folks that might be interested +Robert Bradshaw > > <[email protected]> +Valentyn Tymofieiev <[email protected]> +Sindy > Li > > <[email protected]> +Brian Hulette <[email protected]> ) > > > > Ahmet > > > > > > On Wed, Aug 7, 2019 at 12:36 PM Ning Kang <[email protected]> wrote: > > > > > To whom may concern, > > > > > > This is Ning from Google. We are currently making efforts to leverage > an > > > interactive runner under python beam sdk. > > > > > > There is already an interactive Beam (iBeam for short) runner with > jupyter > > > notebook in the repo > > > < > https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive > > > > > . > > > Following the instructions on that page, one can set up an interactive > > > environment to develop and execute Beam pipeline interactively. > > > > > > However, there are many issues with existing iBeam. One issue is that > it > > > uses a concept of leaf PCollection to cache and materialize > intermediate > > > PCollection. If the user wants to reuse/introspect a non-leaf > PCollection, > > > the interactive runner will run into errors. > > > > > > Our initial effort will be fixing the existing issues. And we also > want to > > > make iBeam easy to use. Since iBeam uses the same model Beam uses, > there > > > isn't really any difference for users between creating a pipeline with > > > interactive runner and other runners. > > > So we want to minimize the interfaces a user needs to learn while > giving > > > the user some capability to interact with the interactive environment. > > > > > > See this initial PR <https://github.com/apache/beam/pull/9278>, the > > > interactive_beam module will provide mainly 4 interfaces: > > > > > > - For advanced users who define pipeline outside __main__, let them > > > tell current interactive environment where they define their > pipeline: > > > watch() > > > - This is very useful for tests where pipeline can be defined in > > > test methods. > > > - If the user simply creates pipeline in a Jupyter notebook or a > > > plain Python script, they don't have to know/use this feature at > all. > > > - Let users create an interactive pipeline: create_pipeline() > > > - invoking create_pipeline(), the user gets a Pipeline object > that > > > works as any other Pipeline object created from > apache_beam.Pipeline() > > > - However, the pipeline object p, when invoking p.run(), does > some > > > extra interactive magic. > > > - We'll support interactive execution for DirectRunner at this > > > moment. > > > - Let users run the interactive pipeline as a normal pipeline: > > > run_pipeline() > > > - In an interactive environment, a user only needs to add and > > > execute 1 line of code run_pipeline(pipeline) to execute any > existing > > > interactive pipeline object as normal pipeline in any selected > platform. > > > - We'll probably support Dataflow only. Other implementations can > > > be added though. > > > - Let users introspect any intermediate PCollection they have > handler > > > to: visualize() > > > - If a user ever writes pcoll = p | "Some Transform" >> > > > some_transform() ..., they can visualize(pcoll) once the > pipeline p is > > > executed. > > > - p can be batch or streaming > > > - The visualization will be some plot graph of data for the given > > > PCollection as if it's materialized. If the PCollection is > unbounded, the > > > graph is dynamic. > > > > > > The PR will implement 1 and 2. > > > > > > We'll use https://issues.apache.org/jira/browse/BEAM-7923 as the top > > > level JIRA and add blocking JIRAs as development goes. > > > > > > External Beam users will not worry about any of the underlying > > > implementation details. > > > Except the 4 interfaces above, they learn and write normal Beam code > and > > > can execute the pipeline immediately when they are done with > prototyping. > > > > > > Ning. > > > > > >
