Yicong-Huang commented on code in PR #53952:
URL: https://github.com/apache/spark/pull/53952#discussion_r2730172442
##########
python/pyspark/sql/conversion.py:
##########
@@ -63,17 +63,18 @@ class ArrowBatchTransformer:
"""
@staticmethod
- def flatten_struct(batch: "pa.RecordBatch") -> "pa.RecordBatch":
+ def flatten_struct(batch: "pa.RecordBatch", column_index: int = 0) ->
"pa.RecordBatch":
"""
- Flatten a single struct column into a RecordBatch.
+ Flatten a struct column at given index into a RecordBatch.
Used by:
- ArrowStreamUDFSerializer.load_stream
- GroupArrowUDFSerializer.load_stream
+ - ArrowStreamArrowUDTFSerializer.load_stream
"""
import pyarrow as pa
- struct = batch.column(0)
+ struct = batch.column(column_index)
return pa.RecordBatch.from_arrays(struct.flatten(),
schema=pa.schema(struct.type))
Review Comment:
No, I'm only selecting the most common repeated pattens here. For many code
that is used by a single code paths, I don't extract it. In total we will have
4-5 transformers for arrow and 3-4 for pandas. And they can be reduced after
clean up when we reduces the unnecessary inheritance of serializers.
Yes the logic is simple to understand by itself, but currently the same
logic got duplicated in multiple code paths, also they have different versions
mixed with different logics. For example some serializers mix flatten and
to_pandas in the same line. Some mix flatten with reorder. By abstracting these
transformers out it can be easier to maintain and read, and reduce duplicate
code paths.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]