[
https://issues.apache.org/jira/browse/FLINK-11976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yurui Zhou updated FLINK-11976:
-------------------------------
Attachment: image-2019-03-25-17-16-43-664.png
> PyFlink vectorized python udf with Pandas support
> -------------------------------------------------
>
> Key: FLINK-11976
> URL: https://issues.apache.org/jira/browse/FLINK-11976
> Project: Flink
> Issue Type: New Feature
> Components: API / Python
> Reporter: Yurui Zhou
> Priority: Major
> Attachments: image-2019-03-25-17-16-43-664.png
>
>
> h2. Motivation
> Currently, the PyFlink allow user to compose Flink data transformation and
> define UDF in python. The PyFlink transform python scripts into operation
> plans and send it over to Java runtime, having the Java runtime execute the
> operations accordingly and return the executed result.
> While encountering Python UDF, the Java runtime create another Python worker,
> serialized the data and have it send over to python worker. The python worker
> processed data in row based manner and send it back to Java runtime.
> There are several limitation with current python udf:
> * Inefficient data movement between Java and Python
> (Serialization/Deserialization)
> * Scalar Computation model
> Goals
> * Enable Pandas support in Flink Python UDF.
> * Enable vectorizied Python UDF execution based on Pandas
> * Using Apache Arrow as the serialization format between Java runtime and
> Python worker
>
> Pandas UDF (vectorized UDF)
> h3. Benefits
> * Provided high performance, easy-to-use data structures and data analysis
> tools for Python.
> * Pandas already provide interface to directly interact with Apache Arrow
> * Enable vectorized computation to fully taking advantage of the Arrow
> Memory layout.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)