[
https://issues.apache.org/jira/browse/SPARK-48493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851794#comment-17851794
]
Luca Canali commented on SPARK-48493:
-------------------------------------
This work appears related https://issues.apache.org/jira/browse/SPARK-48220
> Enhance Python Datasource Reader with Arrow Batch Support for Improved
> Performance
> ----------------------------------------------------------------------------------
>
> Key: SPARK-48493
> URL: https://issues.apache.org/jira/browse/SPARK-48493
> Project: Spark
> Issue Type: Improvement
> Components: PySpark
> Affects Versions: 4.0.0
> Reporter: Luca Canali
> Priority: Minor
> Labels: pull-request-available
>
> This enhancement adds an option to the Python Datasource Reader to yield
> Arrow batches directly, significantly boosting performance compared to using
> tuples or Rows. This implementation leverages the existing work with
> MapInArrow (see SPARK-46253 ).
> Tests with a custom Python Datasource for High Energy Physics (HEP) data
> using the ROOT format reader showed an 8x speed increase when using Arrow
> batches over the traditional method of feeding data via tuples.
> Additional context:
> * The ROOT data format is widely used in High Energy Physics (HEP) with
> approximately 1 exabyte of ROOT data currently in existence.
> * You can easily read ROOT data using libraries from the Python ecosystem,
> notably {{uproot}} and {{{}awkward-array{}}}. These libraries facilitate the
> reading of ROOT data and its conversion to Arrow among other formats.
> * You can write a simple ROOT data source using the Python datasource API.
> While this may not be optimal for performance, it is easy to implement and
> can leverage the mentioned libraries.
> * For better performance, ingest data via Arrow batches rather than row by
> row, as the latter method is significantly slower (an initial test showed it
> to be 8 times slower).
> * Arrow is very popular now, and this enhancement can benefit other
> communities beyond HEP that use Arrow for efficient data processing.
> This enhancement will provide substantial performance improvements and make
> it easier to work with HEP data and other data types using Apache Spark.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]