Hi Ning!
Thanks for the design doc and the explanations.

I think I can grasp some of the concepts, but it is not yet 100% clear to
me why it's necessary to define a new abstraction to have interactivity.
Could you elaborate? Perhaps as a section in the  doc? : )

A lot of the motivation for this doc seems related to how we decide which
PCollections to cache - so as to avoid rerunning parts of a pipeline
whenever a user decides to visualize specific parts. I think that makes
sense (and probably helps to have interactivity on streaming).

I agree that it's a little odd that InteractiveRunner receives an
underlying runner. That certainly suggests that the functionality is
orthogonal.

So, in short: I think my feedback is similar to others: Can you justify
further (or reconsider) why pipeline creation and execution need to be
different?

I can see what's the need for the watch. Can you also tell us more about
how a user would use visualize? Do they pass the kind of plot to have?

Thanks!
-P.

On Wed, Aug 14, 2019 at 12:03 PM Ning Kang <[email protected]> wrote:

> Q1:
> The document is shared (
> https://docs.google.com/document/d/1DYWrT6GL_qDCXhRMoxpjinlVAfHeVilK5Mtf8gO6zxQ/edit?usp=sharing).
> If inside Google, short link (go/ibeam-external
> <https://goto.google.com/ibeam-external>). I cannot share internal
> documents, but you can reach out if you need internal engineering plan.
>
> Q2:
> Yes, watch() is optimization used for using visualization() and building
> further on the pipeline. And the user doesn't need to call it if they
> simply define the pipeline in the notebook.
>
> Q3 and Q4:
> I'm only focusing on direct runner as underlying runner. We'll get rid of
> many of existing interactive Beam implementation. We can't provide
> portability for interactivity. Users can run the pipeline with other
> runners though due to the pipeline portability.
> Our work is to reduce the new concepts a user needs to know when they want
> to run interactive Beam. The implementation could be arbitrarily
> complicated and open sourced though. Currently, the interactive runner
> looks like as if it's supporting all kinds of underlying runners. We want
> to rid of it too.
>
> On 2019/08/08 00:01:06, Ahmet Altay <[email protected]> wrote:
> > Ning, thank you for the heads up.
> >
> > All, this is a proposed work for improving interactive Beam experience.
> As
> > mentioned in Ning's email, new concepts are being introduced. And in
> > addition iBeam as a name is used as a new reference. I hope that bringing
> > the discussion to the mailing list will give it the additional
> > visibility and more people could share their feedback.
> >
> > (cc'ing a few folks that might be interested +Robert Bradshaw
> > <[email protected]> +Valentyn Tymofieiev <[email protected]> +Sindy
> Li
> > <[email protected]> +Brian Hulette <[email protected]> )
> >
> > Ahmet
> >
> >
> > On Wed, Aug 7, 2019 at 12:36 PM Ning Kang <[email protected]> wrote:
> >
> > > To whom may concern,
> > >
> > > This is Ning from Google. We are currently making efforts to leverage
> an
> > > interactive runner under python beam sdk.
> > >
> > > There is already an interactive Beam (iBeam for short) runner with
> jupyter
> > > notebook in the repo
> > > <
> https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive
> >
> > > .
> > > Following the instructions on that page, one can set up an interactive
> > > environment to develop and execute Beam pipeline interactively.
> > >
> > > However, there are many issues with existing iBeam. One issue is that
> it
> > > uses a concept of leaf PCollection to cache and materialize
> intermediate
> > > PCollection. If the user wants to reuse/introspect a non-leaf
> PCollection,
> > > the interactive runner will run into errors.
> > >
> > > Our initial effort will be fixing the existing issues. And we also
> want to
> > > make iBeam easy to use. Since iBeam uses the same model Beam uses,
> there
> > > isn't really any difference for users between creating a pipeline with
> > > interactive runner and other runners.
> > > So we want to minimize the interfaces a user needs to learn while
> giving
> > > the user some capability to interact with the interactive environment.
> > >
> > > See this initial PR <https://github.com/apache/beam/pull/9278>, the
> > > interactive_beam module will provide mainly 4 interfaces:
> > >
> > >    - For advanced users who define pipeline outside __main__, let them
> > >    tell current interactive environment where they define their
> pipeline:
> > >    watch()
> > >       - This is very useful for tests where pipeline can be defined in
> > >       test methods.
> > >       - If the user simply creates pipeline in a Jupyter notebook or a
> > >       plain Python script, they don't have to know/use this feature at
> all.
> > >    - Let users create an interactive pipeline: create_pipeline()
> > >       - invoking create_pipeline(), the user gets a Pipeline object
> that
> > >       works as any other Pipeline object created from
> apache_beam.Pipeline()
> > >       - However, the pipeline object p, when invoking p.run(), does
> some
> > >       extra interactive magic.
> > >       - We'll support interactive execution for DirectRunner at this
> > >       moment.
> > >    - Let users run the interactive pipeline as a normal pipeline:
> > >    run_pipeline()
> > >       - In an interactive environment, a user only needs to add and
> > >       execute 1 line of code run_pipeline(pipeline) to execute any
> existing
> > >       interactive pipeline object as normal pipeline in any selected
> platform.
> > >       - We'll probably support Dataflow only. Other implementations can
> > >       be added though.
> > >    - Let users introspect any intermediate PCollection they have
> handler
> > >    to: visualize()
> > >       - If a user ever writes pcoll = p | "Some Transform" >>
> > >       some_transform() ..., they can visualize(pcoll) once the
> pipeline p is
> > >       executed.
> > >       - p can be batch or streaming
> > >       - The visualization will be some plot graph of data for the given
> > >       PCollection as if it's materialized. If the PCollection is
> unbounded, the
> > >       graph is dynamic.
> > >
> > > The PR will implement 1 and 2.
> > >
> > > We'll use https://issues.apache.org/jira/browse/BEAM-7923 as the top
> > > level JIRA and add blocking JIRAs as development goes.
> > >
> > > External Beam users will not worry about any of the underlying
> > > implementation details.
> > > Except the 4 interfaces above, they learn and write normal Beam code
> and
> > > can execute the pipeline immediately when they are done with
> prototyping.
> > >
> > > Ning.
> > >
> >
>

Reply via email to