Brian Hulette created BEAM-14540:
------------------------------------

             Summary: Native implementation for serialized Rows to/from Arrow
                 Key: BEAM-14540
                 URL: https://issues.apache.org/jira/browse/BEAM-14540
             Project: Beam
          Issue Type: Improvement
          Components: sdk-py-core
            Reporter: Brian Hulette


With https://s.apache.org/batched-dofns (BEAM-14213), we want to encourage 
users to develop pipelines that process arrow data within the Python context. 
But we don't propose any cross-SDK changes, so when Arrow data is processed in 
the SDK, it must be converted to/from Rows for transmission over the Fn API. So 
the ideal Python execution looks like:

1. read row oriented data over the Fn API, deserialize  with SchemaCoder 
2. Buffer rows and construct an arrow RecordBatch/Table object
3. Perform user computation
4. Explode output RecordBatch/Table into rows
5. Serialize rows with SchemaCoder and write out over the Fn API

We can improve performance for this type of flow by making a native 
(cythonized) implementation for (1,2) and (4,5).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to