[jira] [Commented] (SPARK-48493) Enhance Python Datasource Reader with Arrow Batch Support for Improved Performance

Luca Canali (Jira) Mon, 03 Jun 2024 12:42:07 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-48493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851794#comment-17851794
 ]


Luca Canali commented on SPARK-48493:
-------------------------------------

This work appears related https://issues.apache.org/jira/browse/SPARK-48220

 

> Enhance Python Datasource Reader with Arrow Batch Support for Improved 
> Performance
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-48493
>                 URL: https://issues.apache.org/jira/browse/SPARK-48493
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 4.0.0
>            Reporter: Luca Canali
>            Priority: Minor
>              Labels: pull-request-available
>
> This enhancement adds an option to the Python Datasource Reader to yield 
> Arrow batches directly, significantly boosting performance compared to using 
> tuples or Rows. This implementation leverages the existing work with 
> MapInArrow (see SPARK-46253 ).
> Tests with a custom Python Datasource for High Energy Physics (HEP) data 
> using the ROOT format reader showed an 8x speed increase when using Arrow 
> batches over the traditional method of feeding data via tuples.
> Additional context:
>  * The ROOT data format is widely used in High Energy Physics (HEP) with 
> approximately 1 exabyte of ROOT data currently in existence.
>  * You can easily read ROOT data using libraries from the Python ecosystem, 
> notably {{uproot}} and {{{}awkward-array{}}}. These libraries facilitate the 
> reading of ROOT data and its conversion to Arrow among other formats.
>  * You can write a simple ROOT data source using the Python datasource API. 
> While this may not be optimal for performance, it is easy to implement and 
> can leverage the mentioned libraries.
>  * For better performance, ingest data via Arrow batches rather than row by 
> row, as the latter method is significantly slower (an initial test showed it 
> to be 8 times slower).
>  * Arrow is very popular now, and this enhancement can benefit other 
> communities beyond HEP that use Arrow for efficient data processing.
> This enhancement will provide substantial performance improvements and make 
> it easier to work with HEP data and other data types using Apache Spark.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-48493) Enhance Python Datasource Reader with Arrow Batch Support for Improved Performance

Reply via email to