[PR] [SPARK-55336][PYTHON] Let createDF use create_batch logic for decoupling [spark]

via GitHub Mon, 02 Feb 2026 23:19:21 -0800


Yicong-Huang opened a new pull request, #54111:
URL: https://github.com/apache/spark/pull/54111


   ### What changes were proposed in this pull request?
   
   This PR duplicates the pandas-to-Arrow batch conversion logic in 
`ArrowStreamPandasSerializer` to decouple it.
   
   - `create_arrow_array_from_pandas()` - converts a pandas Series to Arrow 
Array
   - `create_arrow_batch_from_pandas()` - converts a list of (series, 
spark_type) tuples to Arrow RecordBatch
   
   Both `_create_from_pandas_with_arrow` (classic Spark) and `createDataFrame` 
(Spark Connect) now use these standalone functions directly with 
`ArrowStreamSerializer`, instead of depending on `ArrowStreamPandasSerializer`.
   
   ### Why are the changes needed?
   
   For better decoupling. Previously, `createDataFrame` had to instantiate 
`ArrowStreamPandasSerializer` just to call its `_create_batch` method. 
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Existing tests.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-55336][PYTHON] Let createDF use create_batch logic for decoupling [spark]

Reply via email to