Re: [PR] [SPARK-55169][PYTHON] Use ArrowBatchTransformer.flatten_struct in ArrowStreamArrowUDTFSerializer [spark]

via GitHub Mon, 26 Jan 2026 20:17:25 -0800


Yicong-Huang commented on code in PR #53952:
URL: https://github.com/apache/spark/pull/53952#discussion_r2730172442



##########
python/pyspark/sql/conversion.py:
##########
@@ -63,17 +63,18 @@ class ArrowBatchTransformer:
     """
 
     @staticmethod
-    def flatten_struct(batch: "pa.RecordBatch") -> "pa.RecordBatch":
+    def flatten_struct(batch: "pa.RecordBatch", column_index: int = 0) -> 
"pa.RecordBatch":
         """
-        Flatten a single struct column into a RecordBatch.
+        Flatten a struct column at given index into a RecordBatch.
 
         Used by:
             - ArrowStreamUDFSerializer.load_stream
             - GroupArrowUDFSerializer.load_stream
+            - ArrowStreamArrowUDTFSerializer.load_stream
         """
         import pyarrow as pa
 
-        struct = batch.column(0)
+        struct = batch.column(column_index)
         return pa.RecordBatch.from_arrays(struct.flatten(), 
schema=pa.schema(struct.type))

Review Comment:
   No, I'm only selecting the most common repeated pattens here. For many code 
that is used by a single code paths, I don't extract it. In total we will have 
4-5 transformers for arrow and 3-4 for pandas. And they can be reduced after 
clean up when we reduces the unnecessary inheritance of serializers. 
   
   Yes the logic is simple to understand by itself, but currently the same 
logic got duplicated in multiple code paths, also they have different versions 
mixed with different logics. For example some serializers mix flatten and 
to_pandas in the same line. Some mix flatten with reorder. By abstracting these 
transformers out it can be easier to maintain and read, and reduce duplicate 
code paths.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-55169][PYTHON] Use ArrowBatchTransformer.flatten_struct in ArrowStreamArrowUDTFSerializer [spark]

Reply via email to