[
https://issues.apache.org/jira/browse/FLINK-11976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16800861#comment-16800861
]
Stephan Ewen commented on FLINK-11976:
--------------------------------------
[~Yurui Zhou] Nice to see your interest in Flink and Python.
There is currently a discussion to rework the Flink Python integration: You can
find parts of this in the following mail thread:
https://lists.apache.org/thread.html/f6f8116b4b38b0b2d70ed45b990d6bb1bcb33611fde6fdf32ec0e840@%3Cdev.flink.apache.org%3E
I believe that there will be a FLIP about this in the near future. [~shaoxuan]
Is one of he committers involved in this. Maybe he can add a few more details.
The new Python effort will most likely be focuses on the Table API, rather than
the DataStream API. That should not be much restriction (you can convert
between DataStream and Table), but I would still be curious if your application
would consider to use a Python Table API.
> PyFlink vectorized python udf with Pandas support
> -------------------------------------------------
>
> Key: FLINK-11976
> URL: https://issues.apache.org/jira/browse/FLINK-11976
> Project: Flink
> Issue Type: New Feature
> Components: API / Python
> Reporter: Yurui Zhou
> Priority: Major
> Attachments: image-2019-03-25-17-16-43-664.png
>
>
> h2. Motivation
> Currently, the PyFlink allow user to compose Flink data transformation and
> define UDF in python. The PyFlink transform python scripts into operation
> plans and send it over to Java runtime, having the Java runtime execute the
> operations accordingly and return the executed result.
> While encountering Python UDF, the Java runtime create another Python worker,
> serialized the data and have it send over to python worker. The python worker
> processed data in row based manner and send it back to Java runtime.
> How flink python UDF works
> !image-2019-03-25-17-16-43-664.png!
> There are several limitation with current python udf:
> * Inefficient data movement between Java and Python
> (Serialization/Deserialization)
> * Scalar Computation model
> Goals
> * Enable Pandas support in Flink Python UDF.
> * Enable vectorizied Python UDF execution based on Pandas
> * Using Apache Arrow as the serialization format between Java runtime and
> Python worker
>
> Pandas UDF (vectorized UDF)
> h3. Benefits
> * Provided high performance, easy-to-use data structures and data analysis
> tools for Python.
> * Pandas already provide interface to directly interact with Apache Arrow
> * Enable vectorized computation to fully taking advantage of the Arrow
> Memory layout.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)