zhengruifeng commented on code in PR #53952:
URL: https://github.com/apache/spark/pull/53952#discussion_r2730143403
##########
python/pyspark/sql/conversion.py:
##########
@@ -63,17 +63,18 @@ class ArrowBatchTransformer:
"""
@staticmethod
- def flatten_struct(batch: "pa.RecordBatch") -> "pa.RecordBatch":
+ def flatten_struct(batch: "pa.RecordBatch", column_index: int = 0) ->
"pa.RecordBatch":
"""
- Flatten a single struct column into a RecordBatch.
+ Flatten a struct column at given index into a RecordBatch.
Used by:
- ArrowStreamUDFSerializer.load_stream
- GroupArrowUDFSerializer.load_stream
+ - ArrowStreamArrowUDTFSerializer.load_stream
"""
import pyarrow as pa
- struct = batch.column(0)
+ struct = batch.column(column_index)
return pa.RecordBatch.from_arrays(struct.flatten(),
schema=pa.schema(struct.type))
Review Comment:
These logic are simple and only one or two lines, we can add tests for them
in `pyspark.tests.upstream`.
The original codes are simple enough to understand, and with this wrapper,
developers have to switch to another file to check the function definition.
Are we going to add all used pandas/pyarrow functions here?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]