[jira] [Updated] (FLINK-11976) PyFlink vectorized python udf with Pandas support

Yurui Zhou (JIRA) Mon, 25 Mar 2019 02:17:27 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-11976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yurui Zhou updated FLINK-11976:
-------------------------------
    Attachment: image-2019-03-25-17-16-43-664.png

> PyFlink vectorized python udf with Pandas support
> -------------------------------------------------
>
>                 Key: FLINK-11976
>                 URL: https://issues.apache.org/jira/browse/FLINK-11976
>             Project: Flink
>          Issue Type: New Feature
>          Components: API / Python
>            Reporter: Yurui Zhou
>            Priority: Major
>         Attachments: image-2019-03-25-17-16-43-664.png
>
>
> h2. Motivation
> Currently, the PyFlink  allow user to compose Flink data transformation and 
> define UDF in python.  The PyFlink  transform python scripts into operation 
> plans and send it over to Java runtime, having the Java runtime execute the 
> operations accordingly and return the executed result. 
> While encountering Python UDF, the Java runtime create another Python worker, 
> serialized the data and have it send over to python worker. The python worker 
> processed data in row based manner and send it back to Java runtime. 
> There are several limitation with current python udf:
>  * Inefficient data movement between Java and Python 
> (Serialization/Deserialization)
>  * Scalar Computation model
>  Goals
>  * Enable Pandas support in Flink Python UDF.
>  * Enable vectorizied Python UDF execution based on Pandas
>  * Using Apache Arrow as the serialization format between Java runtime and 
> Python worker
>  
> Pandas UDF (vectorized UDF)
> h3. Benefits
>  * Provided high performance, easy-to-use data structures and data analysis 
> tools for Python.
>  * Pandas already provide interface to directly interact with Apache Arrow
>  * Enable vectorized computation to fully taking advantage of the Arrow 
> Memory layout. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (FLINK-11976) PyFlink vectorized python udf with Pandas support

Reply via email to