Yicong-Huang opened a new pull request, #56021: URL: https://github.com/apache/spark/pull/56021
### What changes were proposed in this pull request? This PR consolidates the two parallel UDF-result-verification paths in `worker.py` so they share structure: - Extract a new `_verify_column_schema(actual_names, expected_names, *, assign_cols_by_name)` helper that raises `RESULT_COLUMN_NAMES_MISMATCH` (by-name mode) or `RESULT_COLUMN_SCHEMA_MISMATCH` (by-position mode). - Rewrite `verify_pandas_result` to delegate the container-type check to `verify_return_type` (from SPARK-56612) and the name/count check to the new helper. The pandas-specific fallback (RangeIndex / non-string columns -> by-position count check) is preserved by computing the effective `assign_cols_by_name` at the call site. - Rewrite `verify_arrow_result` to delegate the name/count check to the same helper, keeping only the arrow-specific column-type check (`RESULT_COLUMN_TYPES_MISMATCH`) inline. - Fix `verify_return_type`'s package-name derivation so it uses the top-level package (`pandas` rather than `pandas.core` for `pd.DataFrame`), keeping the existing `UDF_RETURN_TYPE` message format intact for both pandas and pyarrow containers. Net diff: `python/pyspark/worker.py | 94 ++++++++++++++++++++++++++++--------------- 116 deletions`. ### Why are the changes needed? Before this PR, the pandas and arrow verify functions duplicated their name/count validation with subtly different code shapes, even though they now raise the same set of error classes (after SPARK-56937 added the missing column-count check to the arrow path). Sharing a single helper: 1. Guarantees consistent error messages and structure across both paths. 2. Removes the temptation for the two paths to drift again as more pandas eval types get refactored (SPARK-55388 work). 3. Makes `verify_pandas_result` a thin shell over `verify_return_type` + shared schema check, mirroring the arrow callsites and shrinking the file by ~22 net lines. No data-conversion change: this is a verification-side cleanup. The pandas path still does **not** do an arrow-type comparison (that responsibility remains with `PandasToArrowConversion.convert` downstream). ### Does this PR introduce _any_ user-facing change? No. The same error classes (`UDF_RETURN_TYPE`, `RESULT_COLUMN_NAMES_MISMATCH`, `RESULT_COLUMN_SCHEMA_MISMATCH`, `RESULT_COLUMN_TYPES_MISMATCH`) are raised under the same conditions with the same `messageParameters` payloads. ### How was this patch tested? Existing tests. No behavior change. Verified locally: - `pyspark.sql.tests.pandas.test_pandas_map` (28 passed) - covers `RESULT_COLUMN_NAMES_MISMATCH` / `RESULT_COLUMN_SCHEMA_MISMATCH` for `wrap_pandas_batch_iter_udf` and the `UDF_RETURN_TYPE` path. - `pyspark.sql.tests.pandas.test_pandas_grouped_map` + `test_pandas_cogrouped_map` (85 passed) - covers `SQL_GROUPED_MAP_PANDAS_(ITER_)UDF` and `wrap_cogrouped_map_pandas_udf`, including the `pandas.DataFrame` container-type error message. - `pyspark.sql.tests.arrow.test_arrow_grouped_map` + `test_arrow_cogrouped_map` (41 passed) - covers `verify_arrow_result` after the helper refactor, including `RESULT_COLUMN_TYPES_MISMATCH`. - `pyspark.sql.tests.test_udtf` (legacy arrow UDTF subset) - covers `read_udtf` callsite at the legacy pandas UDTF path. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
