[PR] [SPARK-56937][PYTHON] Raise error on wrong column count in Arrow grouped/cogrouped map UDF (positional mode) [spark]

via GitHub Tue, 19 May 2026 01:02:55 -0700


Yicong-Huang opened a new pull request, #55978:
URL: https://github.com/apache/spark/pull/55978


   ### What changes were proposed in this pull request?
   
   In `verify_arrow_result` (`python/pyspark/worker.py`), the positional branch
   (used when 
`spark.sql.legacy.execution.pandas.groupedMap.assignColumnsByName=false`)
   uses `zip(expected_cols_and_types, actual_cols_and_types)` without first
   checking that both have the same length, so the shorter list silently
   truncates the longer one and the column-count mismatch is not reported.
   
   This PR adds an explicit length check before the zip and raises
   `RESULT_COLUMN_SCHEMA_MISMATCH` ("Number of columns of the returned data
   doesn't match specified schema. Expected: <N> Actual: <M>") on mismatch.
   The name-based branch already detects this case via the
   `RESULT_COLUMN_NAMES_MISMATCH` missing/extra check, so only the positional
   branch needs the fix.
   
   ### Why are the changes needed?
   
   The bug has been latent since SPARK-40559 (Dec 2023). When the user's
   Arrow grouped/cogrouped map UDF returns the wrong number of columns under
   positional column assignment, the symptom is either:
   * a silent truncation when the UDF returns extra columns (no error at
     all), or
   * a cryptic JVM-side `ArrayIndexOutOfBoundsException` when the UDF
     returns fewer columns than expected.
   
   The name-based branch raises a friendly `RESULT_COLUMN_NAMES_MISMATCH` in
   both cases; positional mode should be symmetric.
   
   Affected eval types: `SQL_GROUPED_MAP_ARROW_UDF`,
   `SQL_GROUPED_MAP_ARROW_ITER_UDF`, `SQL_COGROUPED_MAP_ARROW_UDF`.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. UDFs that previously returned the wrong number of columns under
   positional mode now fail with a clear `RESULT_COLUMN_SCHEMA_MISMATCH`
   error from Python instead of returning truncated data or a JVM
   `ArrayIndexOutOfBoundsException`.
   
   ### How was this patch tested?
   
   Added new tests:
   * `test_apply_in_arrow_returning_wrong_column_count_positional_assignment`
     in `test_arrow_grouped_map.py` (covers both regular and iterator
     variants via `function_variations`).
   * `test_apply_in_arrow_returning_wrong_column_count_positional_assignment`
     in `test_arrow_cogrouped_map.py`.
   
   Each test exercises both the too-many-columns and too-few-columns cases
   under `assignColumnsByName=false` and asserts the
   `RESULT_COLUMN_SCHEMA_MISMATCH` message contains the expected vs. actual
   counts.
   
   The full `test_arrow_grouped_map` and `test_arrow_cogrouped_map` suites
   also pass with the fix.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-56937][PYTHON] Raise error on wrong column count in Arrow grouped/cogrouped map UDF (positional mode) [spark]

Reply via email to