Re: [DISCUSS] Python (and Non-JVM) Language Support in Flink

jincheng sun Wed, 27 Mar 2019 21:16:48 -0700

Hi everyone,
Sorry to join in this discussion late.

Thanks to Xianda Ke for initiating this discussion. I also enjoy the
discussions&suggestions by Max, Austin, Thomas, Shaoxuan and others.


Recently, I did feel the desire of the community and Flink users for Python
support. Stephan also pointed out in the discussion of `Adding a mid-term
roadmap`: "Table API becomes primary API for analytics use cases", while a
large number of users in analytics use cases are accustomed to the Python
language, and the accumulation of a large number of class libraries is also
deposited in the python library.

So I am very interested in participating in the discussion of supporting
Python in Flink. With regard to the three options mentioned so far, it is a
great encouragement to leverage the beam’s language portable layer on
Flink. For now, we can start with a step in the Flink to add a Py-tableAPI.
I believe in, in this process, we will have a deeper understanding of how
Flink support python. If we can quickly let users experience the first
version of Flink Python TableAPI, we can also receive feedback from many
users, and consider the long-term goals of multi-language support on Flink.

So if you agree, I volunteer to draft a document that would support the
detailed design and implementation plan of Py-TableAPI on Flink.

What do you think?

Shaoxuan Wang <[email protected]> 于2019年2月21日周四 下午10:44写道：

> Hey guys,
> Thanks for your comments and sorry for the late reply.
> Beam Python API and Flink Python TableAPI describe the DAG/pipeline in
> different manners. We got a chance to communicate with Tyler Akidau (from
> Beam) offline, and explained why the Flink tableAPI needs a specific design
> for python, rather than purely leverage Beam portability layer.
>
> In our proposal, most of the Python code is just a DAG/pipeline builder for
> tableAPI. The majority of operators run purely in Java, while only python
> UDFs executed in Python environment during the runtime. This design does
> not affect the development and adoption of Beam language portability layer
> with Flink runner. Flink and Beam community will still collaborate, jointly
> develop and optimize on the JVM / Non-JVM (python,GO) bridge (data and
> control connections between different processes) to ensure the robustness
> and performance.
>
> Regards,
> Shaoxuan
>
>
> On Fri, Dec 21, 2018 at 1:39 PM Thomas Weise <[email protected]> wrote:
>
> > Interest in Python seems on the rise and so this is a good discussion to
> > have :)
> >
> > So far there seems to be agreement that Beam's approach towards Python
> and
> > other non-JVM language support (language SDK, portability layer etc.) is
> > the right direction? Specification and execution are native Python and it
> > does not suffer from the shortcomings of Flink's Jython API and few other
> > approaches.
> >
> > Overall there already is good alignment between Beam and Flink in
> concepts
> > and model. There are also few of us that are active in both communities.
> > The Beam Flink runner has made a lot of progress this year, but work on
> > portability in Beam actually started much before that and was a very big
> > change (originally there was just the Java SDK). Much of the code has
> been
> > rewritten as part of the effort; that's what implementing a strong multi
> > language support story took. To have a decent shot at it, the equivalent
> of
> > much of the Beam portability framework would need to be reinvented in
> > Flink. This would fork resources and divert focus away from things that
> may
> > be more core to Flink. As you can guess I am in favor of option (1) !
> >
> > We could take a look at SQL for reference. Flink community has invested a
> > lot in SQL and there remains a lot of work to do. Beam community has done
> > the same and we have two completely separate implementations. When I
> > recently learned more about the Beam SQL work, one of my first questions
> > was if joined effort would not lead to better user value? Calcite is
> > common, but isn't there much more that could be shared? Someone had the
> > idea that in such a world Flink could just substitute portions or all of
> > the graph provided by Beam with it's own optimized version but much of
> the
> > tooling could be same?
> >
> > IO connectors are another area where much effort is repeated. It takes a
> > very long time to arrive at a solid, production quality implementation
> > (typically resulting from broad user exposure and running at scale).
> > Currently there is discussion how connectors can be done much better in
> > both projects: SDF in Beam [1] and FLIP-27.
> >
> > It's a trade-off, but more synergy would be great!
> >
> > Thomas
> >
> > [1]
> >
> >
> https://docs.google.com/document/d/1cKOB9ToasfYs1kLWQgffzvIbJx2Smy4svlodPRhFrk4/
> >
> >
> > On Tue, Dec 18, 2018 at 2:16 PM Austin Bennett <
> > [email protected]>
> > wrote:
> >
> > > Hi Shaoxuan,
> > >
> > > FWIW, Kenn Knowles (Beam PMC Chair) recently gave a talk at the Bay
> Area
> > > Apache Beam Meetup[1] which included a bit on a vision for how Beam
> could
> > > better leverage runner specific optimizations -- as an
> example/extension,
> > > Beam SQL leveraging Flink specific SQL optimizations (to address your
> > > point).  So, that is part of the eventual roadmap for Beam, and
> > illustrates
> > > how concrete efforts towards optimizations in Runner/SDK-Harness would
> > > likely yield the desired result of cross-language support (which could
> be
> > > done by leveraging Beam, and devote focus to optimizing that processing
> > on
> > > Flink).
> > >
> > > Cheers,
> > > Austin
> > >
> > >
> > > [1] https://www.meetup.com/San-Francisco-Apache-Beam/events/256348972/
> > --
> > > I
> > > can post/share videos once available should someone desire.
> > >
> > > On Fri, Dec 14, 2018 at 6:03 AM Maximilian Michels <[email protected]>
> > wrote:
> > >
> > > > Hi Xianda, hi Shaoxuan,
> > > >
> > > > I'd be in favor of option (1). There is great potential in Beam and
> > Flink
> > > > joining forces on this one. Here's why:
> > > >
> > > > The Beam project spent at least a year developing a portability layer
> > > with
> > > > a
> > > > reasonable amount of people working on it. Developing a new
> portability
> > > > layer
> > > > from scratch will probably take about the same amount of time and
> > > > resources.
> > > >
> > > > Concerning option (2): There is already a Python API for Flink but an
> > API
> > > > is
> > > > only one part of the portability story. In Beam the portability is
> > > > structured
> > > > into three components:
> > > >
> > > > - SDK (API, its Protobuf serialization, and interaction with the SDK
> > > > Harness)
> > > > - Runner (Translation from Protobuf pipeline to Flink job)
> > > > - SDK Harness (UDF execution, Interaction with the SDK and the
> > execution
> > > > engine)
> > > >
> > > > I could imagine the Flink Python API would be another SDK which could
> > > have
> > > > its
> > > > own API but would reuse code for the interaction with the SDK
> Harness.
> > > >
> > > > We would be able to focus on the optimizations instead of rebuilding
> a
> > > > portability layer from scratch.
> > > >
> > > > Thanks,
> > > > Max
> > > >
> > > > On 13.12.18 11:52, Shaoxuan Wang wrote:
> > > > > RE: Stephen's options (
> > > > >
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Usage-of-flink-python-and-flink-streaming-python-td25793.html
> > > > > )
> > > > > * Option (1): Language portability via Apache Beam
> > > > > * Option (2): Implement own Python API
> > > > > * Option (3): Implement own portability layer
> > > > >
> > > > > Hi Stephen,
> > > > > Eventually, I think we should support both option1 and option3.
> TMO,
> > > > these
> > > > > two options are orthogonal. I agree with you that we can leverage
> the
> > > > > existing work and ecosystem in beam by supporting option1. But the
> > > > problem
> > > > > of beam is that it skips (to the best of my knowledge) the natural
> > > > > table/SQL optimization framework provided by Flink. We should spend
> > all
> > > > the
> > > > > needed efforts to support solution1 (as it is the better
> alternative
> > of
> > > > the
> > > > > current Flink python API), but cannot solely bet on it. Option3 is
> > the
> > > > > ideal choice for Flink to support all Non-JVM languages which we
> > should
> > > > > better plan to achieve. We have done some preliminary prototypes
> for
> > > > > option2/option3, and it seems not quite complex and difficult to
> > > > accomplish.
> > > > >
> > > > > Regards,
> > > > > Shaoxuan
> > > > >
> > > > >
> > > > > On Thu, Dec 13, 2018 at 4:58 PM Xianda Ke <[email protected]>
> > wrote:
> > > > >
> > > > >> Currently there is an ongoing survey about Python usage of Flink
> > [1].
> > > > Some
> > > > >> discussion was also brought up there regarding non-jvm language
> > > support
> > > > >> strategy in general. To avoid polluting the survey thread, we are
> > > > starting
> > > > >> this discussion thread and would like to move the discussions
> here.
> > > > >>
> > > > >> In the interest of facilitating the discussion, we would like to
> > first
> > > > >> share the following design doc which describes what we have done
> at
> > > > Alibaba
> > > > >> about Python API for Flink. It could serve as a good reference to
> > the
> > > > >> discussion.
> > > > >>
> > > > >>   [DISCUSS] Flink Python API
> > > > >> <
> > > > >>
> > > >
> > >
> >
> https://docs.google.com/document/d/1JNGWdLwbo_btq9RVrc1PjWJV3lYUgPvK0uEWDIfVNJI/edit?usp=drive_web
> > > > >>>
> > > > >>
> > > > >> As of now, we've implemented and delivered Python UDF for SQL for
> > the
> > > > >> internal users at Alibaba.
> > > > >> We are starting to implement Python API.
> > > > >>
> > > > >> To recap and continue the discussion from the survey thread, I
> agree
> > > > with
> > > > >> @Stephan that we should figure out in which general direction
> Python
> > > > >> support should go. Stephan also list three options there:
> > > > >> * Option (1): Language portability via Apache Beam
> > > > >> * Option (2): Implement own Python API
> > > > >> * Option (3): Implement own portability layer
> > > > >>
> > > > >>  From my perspective,
> > > > >> (1). Flink language APIs and Beam's languages support are not
> > mutually
> > > > >> exclusive.
> > > > >> It is nice that Beam has Python/NodeJS/Go APIs, and support Flink
> as
> > > the
> > > > >> runner.
> > > > >> Flink's own Python(or NodeJS/Go) APIs will benefit Flink's
> > ecosystem.
> > > > >>
> > > > >> (2). Python API / portability layer
> > > > >> To support non-JVM languages in Flink,
> > > > >>   * at client side, Flink would provide language interfaces, which
> > > will
> > > > >> translate user's application to Flink StreamGraph.
> > > > >> * at server side, Flink would execute user's UDF code at runtime
> > > > >> The non-JVM languages communicate with JVM via RPC(or low-level
> > > socket,
> > > > >> embedded interpreter and so on). What the portability layer can do
> > > > maybe is
> > > > >> abstracting the RPC layer. When the portability layer is ready,
> > still
> > > > there
> > > > >> are lots of stuff to do for a specified language. Say, Python, we
> > may
> > > > still
> > > > >> have to write the interface classes by hand for the users because
> > > > generated
> > > > >> code without detailed documentation is unacceptable for users, or
> > > handle
> > > > >> the serialization issue of lambda/closure which is not a built-in
> > > > feature
> > > > >> in Python.  Maybe, we can start with Python API, then extend to
> > other
> > > > >> languages and abstract the logic in common as the portability
> layer.
> > > > >>
> > > > >> ---
> > > > >> References:
> > > > >> [1] [SURVEY] Usage of flink-python and flink-streaming-python
> > > > >>
> > > > >> Regards,
> > > > >> Xianda
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Python (and Non-JVM) Language Support in Flink

Reply via email to