Re: [VOTE] naming/usage when running Beam SQL on different runners in notebooks

Alexey Romanenko Fri, 24 Sep 2021 10:11:28 -0700

+1 for option (1)

Are there any plans to add a support for Spark runner?


> On 24 Sep 2021, at 19:03, Ning Kang <[email protected]> wrote:
> 
> To Kyle,
> 
> Yes, we have examples 
> <https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive/examples>
>  to run notebooks with DirectRunner and FlinkRunner locally and 
> DataflowRunner remotely.
> Theoretically, should have access to the PCollection cache, any runner can be 
> used as the underlying runner of the InteractiveRunner since it's only a thin 
> wrapper.
> We plan to add more support in the future such as remotely run notebooks on 
> FlinkRunner-on-Dataproc.
> 
> To Brian,
> Does interactive Beam have a way to set the runner once and have that apply 
> to future invocations?
> Yes, you can set the underlying runner of the InteractiveRunner when you 
> create the pipeline object.
> 
> You can also build a pipeline with one runner and run it with another as long 
> as it's portable. For example, if option (1) is implemented, the "-r, 
> --runner" option allows you to run the pipeline you have built on a selected 
> runner. If you have chained a few beam_sql magics, ideally, Interactive Beam 
> should have memorized them all and can run them as a single pipeline on a 
> different runner.
> Usually you use a known dataset to prototype your SQLs locally and inspect 
> the result quickly, then switch to a production dataset and run it remotely 
> on your production cluster/service.
> 
> Thanks for the vote!
> 
> 
> 
> On Wed, Sep 22, 2021 at 1:05 PM Brian Hulette <[email protected] 
> <mailto:[email protected]>> wrote:
> +1 for option (1).
> 
> Does interactive Beam have a way to set the runner once and have that apply 
> to future invocations?
> 
> I don't like (2), since it could lead to confusion with separate SQL 
> solutions offered by runners (e.g. Dataflow SQL [1], Spark SQL [2], Flink SQL 
> [3])
> 
> [1] https://cloud.google.com/dataflow/docs/guides/sql/dataflow-sql-intro 
> <https://cloud.google.com/dataflow/docs/guides/sql/dataflow-sql-intro>
> [2] https://spark.apache.org/sql/ <https://spark.apache.org/sql/>
> [3] 
> https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/table/sql/overview/
>  
> <https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/table/sql/overview/>
> On Wed, Sep 22, 2021 at 11:04 AM Ning Kang <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi,
> 
> TL;DR: If you are not a notebook/IPython or a Beam SQL user/developer, you 
> may ignore this email. If you are interested in the topic, please continue.
> 
> Recently, we merged an example 
> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/examples/Run%20Beam%20SQL%20with%20beam_sql%20magic.ipynb>
>  notebook utilizing a new `beam_sql` IPython magic 
> <https://ipython.readthedocs.io/en/stable/config/custommagics.html> (if 
> you're not familiar with magics, here are some built-in examples 
> <https://ipython.readthedocs.io/en/stable/interactive/magics.html>) to run 
> Beam SQL <https://beam.apache.org/documentation/dsls/sql/overview/> (for 
> simplicity, we didn't expose the pipeline option to switch dialects, so it's 
> only Calcite SQL) in notebooks with InteractiveRunner (DirectRunner).
> Basically, its usage in notebooks is:
> <help.png>
> And an example query joining 2 PCollections:
> <join.png>
> 
> It currently by default runs locally in a notebook instance on the 
> InteractiveRunner (DirectRunner). We want to allow users to run a pipeline 
> built from it on other runners too.
> Here are a few proposals:
> Add another option in the beam_sql magic: -r, --runner specify the runner 
> name (case/space insensitive) such as a DataflowRunner or a FlinkRunner.
> Add a separate magic for each runner: dataflow_sql, flink_sql and etc.
> Create other runner instances to run the pipeline: 
> SomeRunner(options).run_pipeline(bridge_function(sql_output_pcoll.pipeline))
> Any other suggestions?
> I personally prefer the 1st option at this moment. Please share your 
> thoughts, thanks!
> 
> Ning.

Re: [VOTE] naming/usage when running Beam SQL on different runners in notebooks

Reply via email to