[
https://issues.apache.org/jira/browse/FLINK-11976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chesnay Schepler closed FLINK-11976.
------------------------------------
Resolution: Won't Fix
We currently do not accept contributions to the batch/streaming Python APIs.
> PyFlink vectorized python udf with Pandas support
> -------------------------------------------------
>
> Key: FLINK-11976
> URL: https://issues.apache.org/jira/browse/FLINK-11976
> Project: Flink
> Issue Type: New Feature
> Components: API / Python
> Reporter: Yurui Zhou
> Priority: Major
>
> h2. Motivation
> Currently, the PyFlink allow user to compose Flink data transformation and
> define UDF in python. The PyFlink transform python scripts into operation
> plans and send it over to Java runtime, having the Java runtime execute the
> operations accordingly and return the executed result.
> While encountering Python UDF, the Java runtime create another Python worker,
> serialized the data and have it send over to python worker. The python worker
> processed data in row based manner and send it back to Java runtime.
> How flink python UDF works:
>
>
> !https://intranetproxy.alipay.com/skylark/lark/0/2019/png/93219/1551259580271-ebffa0d7-f675-43bf-aa6d-4e94a54b1f10.png!
>
> There are several limitation with current python udf:
> * Inefficient data movement between Java and Python
> (Serialization/Deserialization)
> * Scalar Computation model
> Goals
> * Enable Pandas support in Flink Python UDF.
> * Enable vectorizied Python UDF execution based on Pandas
> * Using Apache Arrow as the serialization format between Java runtime and
> Python worker
>
> Pandas UDF (vectorized UDF)
> h3. Benefits
> * Provided high performance, easy-to-use data structures and data analysis
> tools for Python.
> * Pandas already provide interface to directly interact with Apache Arrow
> * Enable vectorized computation to fully taking advantage of the Arrow
> Memory layout.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)