Re: Using dataflow from a notebook via Interactive Runner

Ning Kang Tue, 05 Jan 2021 15:00:11 -0800

Hi Sayan,

1. It's not yet officially supported to use the DataflowRunner as the
underlying runner with InteractiveRunner (It's possible to set up GCS
buckets with the underlying source recording and PCollection cache
mechanism to work with the DataflowRunner, but it's not recommended).
You can use the default DirectRunner with a sample of data when creating
the pipeline, then run the pipeline with a DataflowRunner using the full
set of data.


2. Beam Dataframes
<https://beam.apache.org/documentation/dsls/dataframes/overview/> has been
announced.
You should be able to use the Dataframe APIs and convert them to
PCollections with `from apache_beam.dataframe.convert import to_pcollection`

On Tue, Jan 5, 2021 at 8:49 AM Sayan Sanyal <[email protected]> wrote:

> Hello team,
>
> As a user of pyspark, I've been following along the development of Apache
> Beam with some interest. My interest was specifically piqued when I saw the
> investment in the Dataframe AP, as well as the Notebook based Interactive
> Runner.
>
> I had a few questions that I would love to understand better, so any
> pointers would be appreciated.
>
> 1. For the interactive runner
> <https://beam.apache.org/releases/pydoc/2.6.0/_modules/apache_beam/runners/interactive/interactive_runner.html#InteractiveRunner>,
> while the default is direct runner, are we able to use Dataflow here
> instead? I ask because I would love to process large amounts of data that
> won't fit on my notebook's machine interactively and then inspect it.
> Specifically, I'm trying to replicate this functionality from spark in beam:
>
> # read some data from GCS that won't fit in memory
> df = spark.read.parquet(...)
>
> # groupby and summarize data, shuffle is distributed, because otherwise
> Notebook machine would OOM
> result_df = df.groupby(...).agg(...)
>
> # We interactively inspect a random sample of rows from the dataframe,
> need not be in order
> result_df.show(...)
>
> 2. Are there any close demo notebooks planned between the Interactive
> Runner and the Dataframe API? I ask this more leadingly, as I hope that
> give the large number of interactive notebook users out there who primarily
> deal in dataframes, this would be a natural audience for you to market the
> APIs to.
>
> I appreciate any discussion and thoughts.
>
> Thanks,
> Sayan
>
> --
>
> Sayan Sanyal
>
> Data Scientist on Notifications
>
>
>

Re: Using dataflow from a notebook via Interactive Runner

Reply via email to