Re: Beam Python streaming pipeline on Flink Runner

Thomas Weise Thu, 31 Jan 2019 08:38:09 -0800

Exactly, that's what I had in mind.

A Flink runner native transform would make the existing unbounded sources
available, similar to:


https://github.com/apache/beam/blob/2e89c1e4d35e7b5f95a622259d23d921c3d6ad1f/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkStreamingTransformTranslators.java#L167




On Thu, Jan 31, 2019 at 8:18 AM Maximilian Michels <m...@apache.org> wrote:

> Wouldn't it be even more useful for the transition period if we enabled
> Beam IO
> to be used via Flink (like in the legacy Flink Runner)? In this particular
> example, Matthias wants to use PubSubIO, which is not even available as a
> native
> Flink transform.
>
> On 31.01.19 16:21, Thomas Weise wrote:
> > Until SDF is supported, we could also add Flink runner native transforms
> for
> > selected unbounded sources [1].
> >
> > That might be a reasonable option to unblock users that want to try
> Python
> > streaming on Flink.
> >
> > Thomas
> >
> > [1]
> >
> https://github.com/lyft/beam/blob/release-2.10.0-lyft/runners/flink/src/main/java/org/apache/beam/runners/flink/LyftFlinkStreamingPortableTranslations.java
> >
> >
> > On Thu, Jan 31, 2019 at 6:51 AM Maximilian Michels <m...@apache.org
> > <mailto:m...@apache.org>> wrote:
> >
> >      > I have a hard time to imagine how can we map in a generic way
> >     RestrictionTrackers into the existing Bounded/UnboundedSource, so I
> would
> >     love to hear more about the details.
> >
> >     Isn't it the other way around? The SDF is a generalization of
> UnboundedSource.
> >     So we would wrap UnboundedSource using SDF. I'm not saying it is
> trivial, but
> >     SDF offers all the functionality that UnboundedSource needs.
> >
> >     For example, the @GetInitialRestriction method would call split on
> the
> >     UnboundedSource and the restriction trackers would then be used to
> process the
> >     splits.
> >
> >     On 31.01.19 15:16, Ismaël Mejía wrote:
> >      >> Not necessarily. This would be one way. Another way is build an
> SDF
> >     wrapper for UnboundedSource. Probably the easier path for migration.
> >      >
> >      > That would be fantastic, I have heard about such wrapper multiple
> >      > times but so far there is not any realistic proposal. I have a
> hard
> >      > time to imagine how can we map in a generic way
> RestrictionTrackers
> >      > into the existing Bounded/UnboundedSource, so I would love to hear
> >      > more about the details.
> >      >
> >      > On Thu, Jan 31, 2019 at 3:07 PM Maximilian Michels <
> m...@apache.org
> >     <mailto:m...@apache.org>> wrote:
> >      >>
> >      >>   > In addition to have support in the runners, this will
> require a
> >      >>   > rewrite of PubsubIO to use the new SDF API.
> >      >>
> >      >> Not necessarily. This would be one way. Another way is build an
> SDF
> >     wrapper for
> >      >> UnboundedSource. Probably the easier path for migration.
> >      >>
> >      >> On 31.01.19 14:03, Ismaël Mejía wrote:
> >      >>>> Fortunately, there is already a pending PR for cross-language
> >     pipelines which
> >      >>>> will allow us to use Java IO like PubSub in Python jobs.
> >      >>>
> >      >>> In addition to have support in the runners, this will require a
> >      >>> rewrite of PubsubIO to use the new SDF API.
> >      >>>
> >      >>> On Thu, Jan 31, 2019 at 12:23 PM Maximilian Michels <
> m...@apache.org
> >     <mailto:m...@apache.org>> wrote:
> >      >>>>
> >      >>>> Hi Matthias,
> >      >>>>
> >      >>>> This is already reflected in the compatibility matrix, if you
> look
> >     under SDF.
> >      >>>> There is no UnboundedSource interface for portable pipelines.
> That's a
> >     legacy
> >      >>>> abstraction that will be replaced with SDF.
> >      >>>>
> >      >>>> Fortunately, there is already a pending PR for cross-language
> >     pipelines which
> >      >>>> will allow us to use Java IO like PubSub in Python jobs.
> >      >>>>
> >      >>>> Thanks,
> >      >>>> Max
> >      >>>>
> >      >>>> On 31.01.19 12:06, Matthias Baetens wrote:
> >      >>>>> Hey Ankur,
> >      >>>>>
> >      >>>>> Thanks for the swift reply. Should I change this in the
> capability matrix
> >      >>>>> <https://s.apache.org/apache-beam-portability-support-table>
> then?
> >      >>>>>
> >      >>>>> Many thanks.
> >      >>>>> Best,
> >      >>>>> Matthias
> >      >>>>>
> >      >>>>> On Thu, 31 Jan 2019 at 09:31, Ankur Goenka <goe...@google.com
> >     <mailto:goe...@google.com>
> >      >>>>> <mailto:goe...@google.com <mailto:goe...@google.com>>> wrote:
> >      >>>>>
> >      >>>>>       Hi Matthias,
> >      >>>>>
> >      >>>>>       Unfortunately, unbounded reads including pubsub are not
> yet
> >     supported for
> >      >>>>>       portable runners.
> >      >>>>>
> >      >>>>>       Thanks,
> >      >>>>>       Ankur
> >      >>>>>
> >      >>>>>       On Thu, Jan 31, 2019 at 2:44 PM Matthias Baetens
> >     <baetensmatth...@gmail.com <mailto:baetensmatth...@gmail.com>
> >      >>>>>       <mailto:baetensmatth...@gmail.com
> >     <mailto:baetensmatth...@gmail.com>>> wrote:
> >      >>>>>
> >      >>>>>           Hi everyone,
> >      >>>>>
> >      >>>>>           Last few days I have been trying to run a streaming
> >     pipeline (code on
> >      >>>>>           Github <https://github.com/matthiasa4/beam-demo>)
> on a
> >     Flink Runner.
> >      >>>>>
> >      >>>>>           I am running a Flink cluster locally (v1.5.6
> >      >>>>>           <https://flink.apache.org/downloads.html>)
> >      >>>>>           I have built the SDK Harness Container: /./gradlew
> >      >>>>>           :beam-sdks-python-container:docker/
> >      >>>>>           and started the JobServer: /./gradlew
> >      >>>>>           :beam-runners-flink_2.11-job-server:runShadow
> >      >>>>>           -PflinkMasterUrl=localhost:8081./
> >      >>>>>
> >      >>>>>           I run my pipeline with: /env/bin/python
> streaming_pipeline.py
> >      >>>>>           --runner=PortableRunner
> --job_endpoint=localhost:8099
> >     --output xxx
> >      >>>>>           --input_subscription xxx --output_subscription xxx/
> >      >>>>>           /
> >      >>>>>           /
> >      >>>>>           All this is running inside a Ubuntu (Bionic) in a
> Virtualbox.
> >      >>>>>
> >      >>>>>           The job submits fine, but unfortunately fails after
> a few
> >     seconds with
> >      >>>>>           the error attached.
> >      >>>>>
> >      >>>>>           Anything I am missing or doing wrong?
> >      >>>>>
> >      >>>>>           Many thanks.
> >      >>>>>           Best,
> >      >>>>>           Matthias
> >      >>>>>
> >      >>>>>
> >
>

Re: Beam Python streaming pipeline on Flink Runner

Reply via email to