[jira] [Updated] (SPARK-40307) Optimize (De)Serialization of Python UDFs by Arrow

Xinrong Meng (Jira) Tue, 03 Jan 2023 23:21:05 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-40307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Xinrong Meng updated SPARK-40307:
---------------------------------
    Description: 
Python user-defined function (UDF) enables users to run arbitrary code against 
PySpark columns. It uses Pickle for (de)serialization and executes row by row.

One major performance bottleneck of Python UDFs is (de)serialization, that is, 
the data interchanging between the worker JVM and the spawned Python subprocess 
which actually executes the UDF. We should seek an alternative to handle the 
(de)serialization: Arrow, which is used in the (de)serialization of Pandas UDF 
already.

  was:
Python user-defined function (UDF) enables users to run arbitrary code against 
PySpark columns. It uses Pickle for (de)serialization, and executes row by row.

One major performance bottleneck of Python UDFs is (de)serialization, that is, 
the data interchanging between the worker JVM and the spawned Python subprocess 
which actually executes the UDF. We should seek for an alternative to handle 
the (de)serialization: Arrow, which is used in (de)serialization of Pandas UDF 
already.


> Optimize (De)Serialization of Python UDFs by Arrow
> --------------------------------------------------
>
>                 Key: SPARK-40307
>                 URL: https://issues.apache.org/jira/browse/SPARK-40307
>             Project: Spark
>          Issue Type: Umbrella
>          Components: PySpark
>    Affects Versions: 3.4.0
>            Reporter: Xinrong Meng
>            Priority: Major
>
> Python user-defined function (UDF) enables users to run arbitrary code 
> against PySpark columns. It uses Pickle for (de)serialization and executes 
> row by row.
> One major performance bottleneck of Python UDFs is (de)serialization, that 
> is, the data interchanging between the worker JVM and the spawned Python 
> subprocess which actually executes the UDF. We should seek an alternative to 
> handle the (de)serialization: Arrow, which is used in the (de)serialization 
> of Pandas UDF already.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-40307) Optimize (De)Serialization of Python UDFs by Arrow

Reply via email to