[PR] [SPARK-56973][PYTHON] Consolidate verify_pandas_result with verify_arrow_result via shared helper [spark]

via GitHub Wed, 20 May 2026 15:28:47 -0700


Yicong-Huang opened a new pull request, #56021:
URL: https://github.com/apache/spark/pull/56021


   ### What changes were proposed in this pull request?
   
   This PR consolidates the two parallel UDF-result-verification paths in 
`worker.py` so they share structure:
   
   - Extract a new `_verify_column_schema(actual_names, expected_names, *, 
assign_cols_by_name)` helper that raises `RESULT_COLUMN_NAMES_MISMATCH` 
(by-name mode) or `RESULT_COLUMN_SCHEMA_MISMATCH` (by-position mode).
   - Rewrite `verify_pandas_result` to delegate the container-type check to 
`verify_return_type` (from SPARK-56612) and the name/count check to the new 
helper. The pandas-specific fallback (RangeIndex / non-string columns -> 
by-position count check) is preserved by computing the effective 
`assign_cols_by_name` at the call site.
   - Rewrite `verify_arrow_result` to delegate the name/count check to the same 
helper, keeping only the arrow-specific column-type check 
(`RESULT_COLUMN_TYPES_MISMATCH`) inline.
   - Fix `verify_return_type`'s package-name derivation so it uses the 
top-level package (`pandas` rather than `pandas.core` for `pd.DataFrame`), 
keeping the existing `UDF_RETURN_TYPE` message format intact for both pandas 
and pyarrow containers.
   
   Net diff: `python/pyspark/worker.py | 94 
++++++++++++++++++++++++++++--------------- 116 deletions`.
   
   ### Why are the changes needed?
   
   Before this PR, the pandas and arrow verify functions duplicated their 
name/count validation with subtly different code shapes, even though they now 
raise the same set of error classes (after SPARK-56937 added the missing 
column-count check to the arrow path). Sharing a single helper:
   
   1. Guarantees consistent error messages and structure across both paths.
   2. Removes the temptation for the two paths to drift again as more pandas 
eval types get refactored (SPARK-55388 work).
   3. Makes `verify_pandas_result` a thin shell over `verify_return_type` + 
shared schema check, mirroring the arrow callsites and shrinking the file by 
~22 net lines.
   
   No data-conversion change: this is a verification-side cleanup. The pandas 
path still does **not** do an arrow-type comparison (that responsibility 
remains with `PandasToArrowConversion.convert` downstream).
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. The same error classes (`UDF_RETURN_TYPE`, 
`RESULT_COLUMN_NAMES_MISMATCH`, `RESULT_COLUMN_SCHEMA_MISMATCH`, 
`RESULT_COLUMN_TYPES_MISMATCH`) are raised under the same conditions with the 
same `messageParameters` payloads.
   
   ### How was this patch tested?
   
   Existing tests. No behavior change. Verified locally:
   
   - `pyspark.sql.tests.pandas.test_pandas_map` (28 passed) - covers 
`RESULT_COLUMN_NAMES_MISMATCH` / `RESULT_COLUMN_SCHEMA_MISMATCH` for 
`wrap_pandas_batch_iter_udf` and the `UDF_RETURN_TYPE` path.
   - `pyspark.sql.tests.pandas.test_pandas_grouped_map` + 
`test_pandas_cogrouped_map` (85 passed) - covers 
`SQL_GROUPED_MAP_PANDAS_(ITER_)UDF` and `wrap_cogrouped_map_pandas_udf`, 
including the `pandas.DataFrame` container-type error message.
   - `pyspark.sql.tests.arrow.test_arrow_grouped_map` + 
`test_arrow_cogrouped_map` (41 passed) - covers `verify_arrow_result` after the 
helper refactor, including `RESULT_COLUMN_TYPES_MISMATCH`.
   - `pyspark.sql.tests.test_udtf` (legacy arrow UDTF subset) - covers 
`read_udtf` callsite at the legacy pandas UDTF path.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-56973][PYTHON] Consolidate verify_pandas_result with verify_arrow_result via shared helper [spark]

Reply via email to