Yicong Huang created SPARK-55579:
------------------------------------
Summary: Create Arrow-specific error classes for
SCALAR_ITER_ARROW_UDF
Key: SPARK-55579
URL: https://issues.apache.org/jira/browse/SPARK-55579
Project: Spark
Issue Type: Sub-task
Components: PySpark
Affects Versions: 4.2.0
Reporter: Yicong Huang
h2. Background
Currently, {{SQL_SCALAR_ARROW_ITER_UDF}} uses Pandas-specific error classes
(e.g., {{PANDAS_UDF_OUTPUT_EXCEEDS_INPUT_ROWS}},
{{RESULT_LENGTH_MISMATCH_FOR_SCALAR_ITER_PANDAS_UDF}},
{{STOP_ITERATION_OCCURRED_FROM_SCALAR_ITER_PANDAS_UDF}}).
Since this is a pure Arrow UDF eval type, it should use Arrow-specific error
classes for clarity and consistency.
h2. Proposal
Create three new error classes in both Python and Scala:
1. {{ARROW_UDF_OUTPUT_EXCEEDS_INPUT_ROWS}} - For fail-fast check when output
exceeds input rows
2. {{RESULT_LENGTH_MISMATCH_FOR_SCALAR_ITER_ARROW_UDF}} - For final row count
mismatch
3. {{STOP_ITERATION_OCCURRED_FROM_SCALAR_ITER_ARROW_UDF}} - For iterator
consumption verification
Update the {{SQL_SCALAR_ARROW_ITER_UDF}} implementation in
{{python/pyspark/worker.py}} to use these new error classes.
h2. Files to modify
- {{python/pyspark/errors/error-conditions.json}} - Add new error class
definitions
- {{common/utils/src/main/resources/error/error-conditions.json}} - Add
corresponding Scala definitions
- {{python/pyspark/worker.py}} - Update error_class parameters in verify_*
function calls (lines ~3045, ~3052, ~3061)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]