Re: [PR] [SPARK-55169][PYTHON] Use ArrowBatchTransformer.flatten_struct in ArrowStreamArrowUDTFSerializer [spark]

via GitHub Mon, 26 Jan 2026 20:22:57 -0800


Yicong-Huang commented on code in PR #53952:
URL: https://github.com/apache/spark/pull/53952#discussion_r2730172442



##########
python/pyspark/sql/conversion.py:
##########
@@ -63,17 +63,18 @@ class ArrowBatchTransformer:
     """
 
     @staticmethod
-    def flatten_struct(batch: "pa.RecordBatch") -> "pa.RecordBatch":
+    def flatten_struct(batch: "pa.RecordBatch", column_index: int = 0) -> 
"pa.RecordBatch":
         """
-        Flatten a single struct column into a RecordBatch.
+        Flatten a struct column at given index into a RecordBatch.
 
         Used by:
             - ArrowStreamUDFSerializer.load_stream
             - GroupArrowUDFSerializer.load_stream
+            - ArrowStreamArrowUDTFSerializer.load_stream
         """
         import pyarrow as pa
 
-        struct = batch.column(0)
+        struct = batch.column(column_index)
         return pa.RecordBatch.from_arrays(struct.flatten(), 
schema=pa.schema(struct.type))

Review Comment:
   No, I'm only selecting the most common repeated pattens here. For many code 
that is used by a single code path, I don't extract it. In total we will have 
4-5 transformers for arrow and 3-4 for pandas. And they can be reduced after 
clean up when we reduces the unnecessary inheritance of serializers. 
   
   Yes the logic is simple to understand by itself, but currently the same 
logic got duplicated in multiple code paths, also they have different versions 
mixed with different logics. For example some serializers mix flatten and 
to_pandas in the same line. Some mix flatten with reorder. It is unclear to me 
on the surface how many serializers flatten the struct, and which serializers 
reorder columns. 
   
   By abstracting these transformers out it can be easier to maintain and read, 
and reduce duplicate code paths. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-55169][PYTHON] Use ArrowBatchTransformer.flatten_struct in ArrowStreamArrowUDTFSerializer [spark]

Reply via email to