Hi Sayan, 1. It's not yet officially supported to use the DataflowRunner as the underlying runner with InteractiveRunner (It's possible to set up GCS buckets with the underlying source recording and PCollection cache mechanism to work with the DataflowRunner, but it's not recommended). You can use the default DirectRunner with a sample of data when creating the pipeline, then run the pipeline with a DataflowRunner using the full set of data.
2. Beam Dataframes <https://beam.apache.org/documentation/dsls/dataframes/overview/> has been announced. You should be able to use the Dataframe APIs and convert them to PCollections with `from apache_beam.dataframe.convert import to_pcollection` On Tue, Jan 5, 2021 at 8:49 AM Sayan Sanyal <[email protected]> wrote: > Hello team, > > As a user of pyspark, I've been following along the development of Apache > Beam with some interest. My interest was specifically piqued when I saw the > investment in the Dataframe AP, as well as the Notebook based Interactive > Runner. > > I had a few questions that I would love to understand better, so any > pointers would be appreciated. > > 1. For the interactive runner > <https://beam.apache.org/releases/pydoc/2.6.0/_modules/apache_beam/runners/interactive/interactive_runner.html#InteractiveRunner>, > while the default is direct runner, are we able to use Dataflow here > instead? I ask because I would love to process large amounts of data that > won't fit on my notebook's machine interactively and then inspect it. > Specifically, I'm trying to replicate this functionality from spark in beam: > > # read some data from GCS that won't fit in memory > df = spark.read.parquet(...) > > # groupby and summarize data, shuffle is distributed, because otherwise > Notebook machine would OOM > result_df = df.groupby(...).agg(...) > > # We interactively inspect a random sample of rows from the dataframe, > need not be in order > result_df.show(...) > > 2. Are there any close demo notebooks planned between the Interactive > Runner and the Dataframe API? I ask this more leadingly, as I hope that > give the large number of interactive notebook users out there who primarily > deal in dataframes, this would be a natural audience for you to market the > APIs to. > > I appreciate any discussion and thoughts. > > Thanks, > Sayan > > -- > > Sayan Sanyal > > Data Scientist on Notifications > > >
