Yicong Huang created SPARK-55579:
------------------------------------

             Summary: Create Arrow-specific error classes for 
SCALAR_ITER_ARROW_UDF
                 Key: SPARK-55579
                 URL: https://issues.apache.org/jira/browse/SPARK-55579
             Project: Spark
          Issue Type: Sub-task
          Components: PySpark
    Affects Versions: 4.2.0
            Reporter: Yicong Huang


h2. Background

Currently, {{SQL_SCALAR_ARROW_ITER_UDF}} uses Pandas-specific error classes 
(e.g., {{PANDAS_UDF_OUTPUT_EXCEEDS_INPUT_ROWS}}, 
{{RESULT_LENGTH_MISMATCH_FOR_SCALAR_ITER_PANDAS_UDF}}, 
{{STOP_ITERATION_OCCURRED_FROM_SCALAR_ITER_PANDAS_UDF}}).

Since this is a pure Arrow UDF eval type, it should use Arrow-specific error 
classes for clarity and consistency.

h2. Proposal

Create three new error classes in both Python and Scala:

1. {{ARROW_UDF_OUTPUT_EXCEEDS_INPUT_ROWS}} - For fail-fast check when output 
exceeds input rows
2. {{RESULT_LENGTH_MISMATCH_FOR_SCALAR_ITER_ARROW_UDF}} - For final row count 
mismatch
3. {{STOP_ITERATION_OCCURRED_FROM_SCALAR_ITER_ARROW_UDF}} - For iterator 
consumption verification

Update the {{SQL_SCALAR_ARROW_ITER_UDF}} implementation in 
{{python/pyspark/worker.py}} to use these new error classes.

h2. Files to modify

- {{python/pyspark/errors/error-conditions.json}} - Add new error class 
definitions
- {{common/utils/src/main/resources/error/error-conditions.json}} - Add 
corresponding Scala definitions
- {{python/pyspark/worker.py}} - Update error_class parameters in verify_* 
function calls (lines ~3045, ~3052, ~3061)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to