BryanCutler opened a new pull request #24095: [SPARK-27163][PYTHON] Cleanup and 
consolidate Pandas UDF functionality
URL: https://github.com/apache/spark/pull/24095
 
 
   ## What changes were proposed in this pull request?
   
   This change is a cleanup and consolidation of 3 areas related to Pandas UDFs:
   
   1) `ArrowStreamPandasSerializer` now inherits from `ArrowStreamSerializer` 
and uses the base class `dump_stream`, `load_stream` to create Arrow 
reader/writer and send Arrow record batches.  `ArrowStreamPandasSerializer` 
makes the conversions to/from Pandas and converts to Arrow record batch 
iterators. This change removed duplicated creation of Arrow readers/writers.
   
   2) `createDataFrame` with Arrow now uses `ArrowStreamPandasSerializer` 
instead of doing its own conversions from Pandas to Arrow and sending record 
batches through `ArrowStreamSerializer`.
   
   3) Grouped Map UDFs now reuse existing logic in 
`ArrowStreamPandasSerializer` to send Pandas DataFrame results as a 
`StructType` instead of separating each column from the DataFrame. This makes 
the code a little more consistent with the Python worker, but does require that 
the returned StructType column is flattened out in `FlatMapGroupsInPandasExec` 
in Scala.
   
   ## How was this patch tested?
   
   Existing tests and ran tests with pyarrow 0.12.0

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to