[ 
https://issues.apache.org/jira/browse/SPARK-56222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-56222:
-----------------------------------
    Labels: pull-request-available  (was: )

> Split ArrowStreamSerializer into GroupSerializer and CoGroupSerializer for 
> better type hints
> --------------------------------------------------------------------------------------------
>
>                 Key: SPARK-56222
>                 URL: https://issues.apache.org/jira/browse/SPARK-56222
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark
>    Affects Versions: 4.2.0
>            Reporter: Yicong Huang
>            Priority: Major
>              Labels: pull-request-available
>
> Currently ArrowStreamSerializer uses num_dfs to switch between plain stream 
> (num_dfs=0), grouped (num_dfs=1), and cogrouped (num_dfs=2) modes. This makes 
> load_stream return different types depending on runtime state:
> - num_dfs=0: Iterator[pa.RecordBatch]
> - num_dfs=1: Iterator[Iterator[pa.RecordBatch]]
> - num_dfs=2: Iterator[Tuple[Iterator[pa.RecordBatch], 
> Iterator[pa.RecordBatch]]]
> This prevents proper type annotations on both the serializer and the func 
> closures in read_udfs().
> Proposal: split into dedicated GroupSerializer and CoGroupSerializer classes 
> so each has a single, well-typed load_stream signature.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to