Yicong-Huang commented on code in PR #53952:
URL: https://github.com/apache/spark/pull/53952#discussion_r2730043004
##########
python/pyspark/sql/conversion.py:
##########
@@ -63,17 +63,18 @@ class ArrowBatchTransformer:
"""
@staticmethod
- def flatten_struct(batch: "pa.RecordBatch") -> "pa.RecordBatch":
+ def flatten_struct(batch: "pa.RecordBatch", column_index: int = 0) ->
"pa.RecordBatch":
"""
- Flatten a single struct column into a RecordBatch.
+ Flatten a struct column at given index into a RecordBatch.
Used by:
- ArrowStreamUDFSerializer.load_stream
- GroupArrowUDFSerializer.load_stream
+ - ArrowStreamArrowUDTFSerializer.load_stream
"""
import pyarrow as pa
- struct = batch.column(0)
+ struct = batch.column(column_index)
return pa.RecordBatch.from_arrays(struct.flatten(),
schema=pa.schema(struct.type))
Review Comment:
I don’t think this is over-engineering. We currently have similar logic
duplicated across multiple serializers, and this consolidates those patterns
into one discoverable place. These are still simple, pure functions that can be
tested independently and composed for readability, so the abstraction is mainly
about organization and reuse rather than adding real complexity.
If we later find that this layer isn’t pulling its weight, we can always
inline the functions back into the serializers after the cleanup. For now,
having a single place to centralize and deduplicate this logic makes the
codebase easier to reason about. WDYT?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]