[jira] [Assigned] (SPARK-30153) Extend data exchange options for vectorized UDF functions with vanilla Arrow serialization

Apache Spark (Jira) Thu, 30 Apr 2020 09:49:25 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-30153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Apache Spark reassigned SPARK-30153:
------------------------------------

    Assignee:     (was: Apache Spark)

> Extend data exchange options for vectorized UDF functions with vanilla Arrow 
> serialization
> ------------------------------------------------------------------------------------------
>
>                 Key: SPARK-30153
>                 URL: https://issues.apache.org/jira/browse/SPARK-30153
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 3.1.0
>            Reporter: Luca Canali
>            Priority: Minor
>         Attachments: Flamegraph_test_pandas_udf_SCALAR.png, 
> Flamegraph_test_pandas_udf_SCALAR_ARROW.png
>
>
> Spark has introduced vectorized UDF with pandas_udf and this provides 
> considerable speed up by reducing the overhead due to serialization and 
> deserialization, where applciable.
> The current implementation of pandas_udf uses Arrow for fast serialization 
> and then Pandas Series (or Pandas DF) for processing.
> There are opportunities to improve UDF performance, in certain cases, by 
> bypaasing the conversion to and from Pandas and using Arrow Tables, directly 
> with the help of specialized libraries able to process Arrow Tables and 
> Arrays.
> One such case is for scientific computing of high energy physics data, where 
> processing of arrays of data is of key importance.
> A test case using such approach has shown an increase of performance of about 
> 3x, compared to the equivalent processing with pandas_udf, for a UDF based on 
> plain Arrow serialization using a custom-developed extension of pandas_udf.  
> Processing of Arrow data in the test case was done via the "awkward arrays" 
> library (https://github.com/scikit-hep/awkward-array).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Assigned] (SPARK-30153) Extend data exchange options for vectorized UDF functions with vanilla Arrow serialization

Reply via email to