Yicong-Huang commented on code in PR #53992:
URL: https://github.com/apache/spark/pull/53992#discussion_r2779963075


##########
python/pyspark/worker.py:
##########
@@ -2847,10 +2788,19 @@ def read_udfs(pickleSer, infile, eval_type, 
runner_conf):
                 or eval_type == PythonEvalType.SQL_MAP_PANDAS_ITER_UDF
             )
             # Arrow-optimized Python UDF takes a struct type argument as a Row
+            # When legacy pandas conversion is enabled, use "row" and convert 
ndarray to list
             struct_in_pandas = (
-                "row" if eval_type == PythonEvalType.SQL_ARROW_BATCHED_UDF 
else "dict"
+                "row"
+                if (
+                    eval_type == PythonEvalType.SQL_ARROW_BATCHED_UDF
+                    or runner_conf.use_legacy_pandas_udf_conversion

Review Comment:
   Good catch! You're right.. the `or 
runner_conf.use_legacy_pandas_udf_conversion` is redundant here. When 
`use_legacy_pandas_udf_conversion=True` and `eval_type=SQL_ARROW_BATCHED_UDF`, 
the earlier elif condition (line 2759) doesn't match, so it falls through to 
the else branch and the or clause is unnecessary for that case.
   And as you pointed out, it has an unintended side effect: if 
`use_legacy_pandas_udf_conversion=True` while eval_type is something else 
(e.g., `SQL_SCALAR_PANDAS_UDF`), it would incorrectly change `struct_in_pandas` 
from "dict" to "row".
   I've created a follow-up PR to fix this: #54212



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to