[jira] [Closed] (FLINK-11976) PyFlink vectorized python udf with Pandas support

Chesnay Schepler (JIRA) Wed, 20 Mar 2019 03:04:10 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-11976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chesnay Schepler closed FLINK-11976.
------------------------------------
    Resolution: Won't Fix

We currently do not accept contributions to the batch/streaming Python APIs.

> PyFlink vectorized python udf with Pandas support
> -------------------------------------------------
>
>                 Key: FLINK-11976
>                 URL: https://issues.apache.org/jira/browse/FLINK-11976
>             Project: Flink
>          Issue Type: New Feature
>          Components: API / Python
>            Reporter: Yurui Zhou
>            Priority: Major
>
> h2. Motivation
> Currently, the PyFlink  allow user to compose Flink data transformation and 
> define UDF in python.  The PyFlink  transform python scripts into operation 
> plans and send it over to Java runtime, having the Java runtime execute the 
> operations accordingly and return the executed result. 
> While encountering Python UDF, the Java runtime create another Python worker, 
> serialized the data and have it send over to python worker. The python worker 
> processed data in row based manner and send it back to Java runtime. 
> How flink python UDF works:
>  
>  
> !https://intranetproxy.alipay.com/skylark/lark/0/2019/png/93219/1551259580271-ebffa0d7-f675-43bf-aa6d-4e94a54b1f10.png!
>   
> There are several limitation with current python udf:
>  * Inefficient data movement between Java and Python 
> (Serialization/Deserialization)
>  * Scalar Computation model
>  Goals
>  * Enable Pandas support in Flink Python UDF.
>  * Enable vectorizied Python UDF execution based on Pandas
>  * Using Apache Arrow as the serialization format between Java runtime and 
> Python worker
>  
> Pandas UDF (vectorized UDF)
> h3. Benefits
>  * Provided high performance, easy-to-use data structures and data analysis 
> tools for Python.
>  * Pandas already provide interface to directly interact with Apache Arrow
>  * Enable vectorized computation to fully taking advantage of the Arrow 
> Memory layout. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Closed] (FLINK-11976) PyFlink vectorized python udf with Pandas support

Reply via email to