itholic opened a new pull request, #39137:
URL: https://github.com/apache/spark/pull/39137

   ### What changes were proposed in this pull request?
   
   This PR proposes to introduce `pyspark.errors` and error classes to unifying 
& improving errors generated by PySpark under a single path.
   
   This PR includes the changes below:
   - Add error classes for PySpark and its sub error classes into 
`error-classes.json`.
   - Add `PySparkErrors` in JVM side to leverage the existing error framework
   - Add new module: `pyspark.errors`
   - Add new errors defined in `pyspark.errors.errors` that return the 
`PySparkException` by leveraging new error classes.
   - Migrate the error messages into error classes for 
`pyspark/sql/functions.py`
   - Add tests for migrated error messages.
   - Add test util `check_error` for testing errors by its error classes.
   
   This is an initial PR for introducing error framework for PySpark to 
facilitate the error management and provide better/consistent error messages to 
users.
   
   While such an active work is being done on the [SQL side to improve error 
messages](https://issues.apache.org/jira/browse/SPARK-37935), so far there is 
no work to improve error messages in PySpark.
   
   So, I'd expect to also initiate the effort on error message improvement for 
PySpark side from this PR.
   
   **Next up** for this PR include:
   - Migrate more Python built-in exceptions generated by driver side into 
PySpark-specific errors.
   - Migrate the errors generated by `Py4J` into PySpark-specific errors.
   - Migrate the errors generated by Python worker side into PySpark-specific 
errors.
   - Migrate more error tests into tests using `checkError`.
   - Currently all PySpark-specific errors are defined as `PySparkException` 
class. As the number of PySpark-specific errors increases in the future, it may 
be necessary to further refine the `PySparkException` into multiple categories
   - Add documentation
   
   Will add more items to [umbrella 
JIRA](https://issues.apache.org/jira/browse/SPARK-41597) once initial PR get 
approved.
   
   ### Why are the changes needed?
   
   Centralizing error messages & introducing identified error class provides 
the following benefits:
   - Errors are searchable via the unique class names and properly classified.
   - Reduce the cost of future maintenance for PySpark errors.
   - Provide consistent & actionable error messages to users.
   - Facilitates translating error messages into different languages.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, but only for error message. No API changes at all.
   
   For example,
   
   **Before**
   ```python
   >>> from pyspark.sql import functions as F
   >>> F.window("date", 5)
   Traceback (most recent call last):
   ...
   TypeError: windowDuration should be provided as a string
   ```
   
   **After**
   ```python
   Traceback (most recent call last):
   ...
   pyspark.errors.exceptions.PySparkException: [PYSPARK.NOT_A_STRING]  Argument 
'windowDuration' should be a string, got 'int'.
   ```
   
   ### How was this patch tested?
   
   By adding unittests and manually test the static analysis from 
`dev/lint-python`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to