+1 for option (1) Are there any plans to add a support for Spark runner?
> On 24 Sep 2021, at 19:03, Ning Kang <[email protected]> wrote: > > To Kyle, > > Yes, we have examples > <https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive/examples> > to run notebooks with DirectRunner and FlinkRunner locally and > DataflowRunner remotely. > Theoretically, should have access to the PCollection cache, any runner can be > used as the underlying runner of the InteractiveRunner since it's only a thin > wrapper. > We plan to add more support in the future such as remotely run notebooks on > FlinkRunner-on-Dataproc. > > To Brian, > Does interactive Beam have a way to set the runner once and have that apply > to future invocations? > Yes, you can set the underlying runner of the InteractiveRunner when you > create the pipeline object. > > You can also build a pipeline with one runner and run it with another as long > as it's portable. For example, if option (1) is implemented, the "-r, > --runner" option allows you to run the pipeline you have built on a selected > runner. If you have chained a few beam_sql magics, ideally, Interactive Beam > should have memorized them all and can run them as a single pipeline on a > different runner. > Usually you use a known dataset to prototype your SQLs locally and inspect > the result quickly, then switch to a production dataset and run it remotely > on your production cluster/service. > > Thanks for the vote! > > > > On Wed, Sep 22, 2021 at 1:05 PM Brian Hulette <[email protected] > <mailto:[email protected]>> wrote: > +1 for option (1). > > Does interactive Beam have a way to set the runner once and have that apply > to future invocations? > > I don't like (2), since it could lead to confusion with separate SQL > solutions offered by runners (e.g. Dataflow SQL [1], Spark SQL [2], Flink SQL > [3]) > > [1] https://cloud.google.com/dataflow/docs/guides/sql/dataflow-sql-intro > <https://cloud.google.com/dataflow/docs/guides/sql/dataflow-sql-intro> > [2] https://spark.apache.org/sql/ <https://spark.apache.org/sql/> > [3] > https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/table/sql/overview/ > > <https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/table/sql/overview/> > On Wed, Sep 22, 2021 at 11:04 AM Ning Kang <[email protected] > <mailto:[email protected]>> wrote: > Hi, > > TL;DR: If you are not a notebook/IPython or a Beam SQL user/developer, you > may ignore this email. If you are interested in the topic, please continue. > > Recently, we merged an example > <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/examples/Run%20Beam%20SQL%20with%20beam_sql%20magic.ipynb> > notebook utilizing a new `beam_sql` IPython magic > <https://ipython.readthedocs.io/en/stable/config/custommagics.html> (if > you're not familiar with magics, here are some built-in examples > <https://ipython.readthedocs.io/en/stable/interactive/magics.html>) to run > Beam SQL <https://beam.apache.org/documentation/dsls/sql/overview/> (for > simplicity, we didn't expose the pipeline option to switch dialects, so it's > only Calcite SQL) in notebooks with InteractiveRunner (DirectRunner). > Basically, its usage in notebooks is: > <help.png> > And an example query joining 2 PCollections: > <join.png> > > It currently by default runs locally in a notebook instance on the > InteractiveRunner (DirectRunner). We want to allow users to run a pipeline > built from it on other runners too. > Here are a few proposals: > Add another option in the beam_sql magic: -r, --runner specify the runner > name (case/space insensitive) such as a DataflowRunner or a FlinkRunner. > Add a separate magic for each runner: dataflow_sql, flink_sql and etc. > Create other runner instances to run the pipeline: > SomeRunner(options).run_pipeline(bridge_function(sql_output_pcoll.pipeline)) > Any other suggestions? > I personally prefer the 1st option at this moment. Please share your > thoughts, thanks! > > Ning.
