Re: [PR] [SPARK-55224][PYTHON] Use Spark DataType as ground truth in Pandas-Arrow serialization [spark]

via GitHub Sat, 07 Feb 2026 04:57:58 -0800


zhengruifeng commented on code in PR #53992:
URL: https://github.com/apache/spark/pull/53992#discussion_r2777515521



##########
python/pyspark/worker.py:
##########
@@ -2847,10 +2788,19 @@ def read_udfs(pickleSer, infile, eval_type, 
runner_conf):
                 or eval_type == PythonEvalType.SQL_MAP_PANDAS_ITER_UDF
             )
             # Arrow-optimized Python UDF takes a struct type argument as a Row
+            # When legacy pandas conversion is enabled, use "row" and convert 
ndarray to list
             struct_in_pandas = (
-                "row" if eval_type == PythonEvalType.SQL_ARROW_BATCHED_UDF 
else "dict"
+                "row"
+                if (
+                    eval_type == PythonEvalType.SQL_ARROW_BATCHED_UDF
+                    or runner_conf.use_legacy_pandas_udf_conversion

Review Comment:
   @Yicong-Huang why adding `or runner_conf.use_legacy_pandas_udf_conversion` 
here?
   
   the `use_legacy_pandas_udf_conversion` is supposed to only take effect in 
`SQL_ARROW_BATCHED_UDF`.
   
   Suppose the eval type is `SQL_SCALAR_PANDAS_UDF` and the config 
`use_legacy_pandas_udf_conversion` is true, 
   
   the `struct_in_pandas` was changed from `dict` -> `row`?



##########
python/pyspark/worker.py:
##########
@@ -2847,10 +2788,19 @@ def read_udfs(pickleSer, infile, eval_type, 
runner_conf):
                 or eval_type == PythonEvalType.SQL_MAP_PANDAS_ITER_UDF
             )
             # Arrow-optimized Python UDF takes a struct type argument as a Row
+            # When legacy pandas conversion is enabled, use "row" and convert 
ndarray to list
             struct_in_pandas = (
-                "row" if eval_type == PythonEvalType.SQL_ARROW_BATCHED_UDF 
else "dict"
+                "row"
+                if (
+                    eval_type == PythonEvalType.SQL_ARROW_BATCHED_UDF
+                    or runner_conf.use_legacy_pandas_udf_conversion
+                )
+                else "dict"
+            )
+            ndarray_as_list = (
+                eval_type == PythonEvalType.SQL_ARROW_BATCHED_UDF
+                or runner_conf.use_legacy_pandas_udf_conversion

Review Comment:
   ditto?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-55224][PYTHON] Use Spark DataType as ground truth in Pandas-Arrow serialization [spark]

Reply via email to