Yicong Huang created SPARK-55197:
------------------------------------

             Summary: Extract _insert_stream_start helper to deduplicate 
START_ARROW_STREAM signal logic
                 Key: SPARK-55197
                 URL: https://issues.apache.org/jira/browse/SPARK-55197
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
    Affects Versions: 4.2.0
            Reporter: Yicong Huang


Multiple Arrow serializers repeat the same pattern for writing 
{{START_ARROW_STREAM}} before dumping batches:

{code:python}
first = next(iterator, None)
if first is None:
    return
write_int(SpecialLengths.START_ARROW_STREAM, stream)
# then chain first with rest...
{code}

This pattern appears in {{ArrowStreamUDFSerializer}}, 
{{ArrowStreamPandasUDFSerializer}}, {{ArrowStreamArrowUDFSerializer}}, and 
{{ApplyInPandasWithStateSerializer}}.

Proposal: Extract a {{_insert_stream_start}} helper in 
{{ArrowStreamSerializer}} to centralize this logic.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to