[GitHub] [beam] damccorm opened a new issue, #21657: Native implementation for serialized Rows to/from Arrow

GitBox Sat, 04 Jun 2022 18:10:44 -0700


damccorm opened a new issue, #21657:
URL: https://github.com/apache/beam/issues/21657


   With https://s.apache.org/batched-dofns (BEAM-14213), we want to encourage 
users to develop pipelines that process arrow data within the Python SDK, but 
communicating batches of data across SDKs or from SDK to Runner is left as 
future work. So when Arrow data is processed in the SDK, it must be converted 
to/from Rows for transmission over the Fn API. So the current ideal Python 
execution looks like:
   
   1. read row oriented data over the Fn API, deserialize  with SchemaCoder 
   2. Buffer rows and construct an arrow RecordBatch/Table object
   3. Perform user computation(s)
   4. Explode output RecordBatch/Table into rows
   5. Serialize rows with SchemaCoder and write out over the Fn API
   
   Note that (1,2) and (4,5) will exist in every stage of the user's pipeline, 
and they'll also exist when Python transforms (e.g. dataframe read_csv) are 
used in other SDKs. We should improve performance for this hot path by making a 
native (cythonized) implementation for (1,2) and (4,5).
   
   Imported from Jira 
[BEAM-14540](https://issues.apache.org/jira/browse/BEAM-14540). Original Jira 
may contain additional context.
   Reported by: bhulette.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] damccorm opened a new issue, #21657: Native implementation for serialized Rows to/from Arrow

Reply via email to