[ 
https://issues.apache.org/jira/browse/BEAM-14540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Hulette updated BEAM-14540:
---------------------------------
    Description: 
With https://s.apache.org/batched-dofns (BEAM-14213), we want to encourage 
users to develop pipelines that process arrow data within the Python context, 
but communicating batches of data across SDKs or from SDK to Runner is left as 
future work. So when Arrow data is processed in the SDK, it must be converted 
to/from Rows for transmission over the Fn API. So the ideal Python execution 
looks like:

1. read row oriented data over the Fn API, deserialize  with SchemaCoder 
2. Buffer rows and construct an arrow RecordBatch/Table object
3. Perform user computation
4. Explode output RecordBatch/Table into rows
5. Serialize rows with SchemaCoder and write out over the Fn API

We can improve performance for this type of flow by making a native 
(cythonized) implementation for (1,2) and (4,5).

  was:
With https://s.apache.org/batched-dofns (BEAM-14213), we want to encourage 
users to develop pipelines that process arrow data within the Python context. 
But we don't propose any cross-SDK changes, so when Arrow data is processed in 
the SDK, it must be converted to/from Rows for transmission over the Fn API. So 
the ideal Python execution looks like:

1. read row oriented data over the Fn API, deserialize  with SchemaCoder 
2. Buffer rows and construct an arrow RecordBatch/Table object
3. Perform user computation
4. Explode output RecordBatch/Table into rows
5. Serialize rows with SchemaCoder and write out over the Fn API

We can improve performance for this type of flow by making a native 
(cythonized) implementation for (1,2) and (4,5).


> Native implementation for serialized Rows to/from Arrow
> -------------------------------------------------------
>
>                 Key: BEAM-14540
>                 URL: https://issues.apache.org/jira/browse/BEAM-14540
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-py-core
>            Reporter: Brian Hulette
>            Priority: P2
>
> With https://s.apache.org/batched-dofns (BEAM-14213), we want to encourage 
> users to develop pipelines that process arrow data within the Python context, 
> but communicating batches of data across SDKs or from SDK to Runner is left 
> as future work. So when Arrow data is processed in the SDK, it must be 
> converted to/from Rows for transmission over the Fn API. So the ideal Python 
> execution looks like:
> 1. read row oriented data over the Fn API, deserialize  with SchemaCoder 
> 2. Buffer rows and construct an arrow RecordBatch/Table object
> 3. Perform user computation
> 4. Explode output RecordBatch/Table into rows
> 5. Serialize rows with SchemaCoder and write out over the Fn API
> We can improve performance for this type of flow by making a native 
> (cythonized) implementation for (1,2) and (4,5).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to