itholic opened a new pull request, #39128:
URL: https://github.com/apache/spark/pull/39128
### What changes were proposed in this pull request?
This PR proposes to introduce `pyspark.errors` and error classes to unifying
& improving errors generated by PySpark under a single path.
This PR includes the changes below:
- `python/pyspark/__init__.py`
- Add new class `PySparkException`.
- Add PySpark-specific errors that raise `PySparkException`.
- `python/pyspark/sql/functions.py`
- Migrate Python built-in exceptions to PySpark-specific errors.
- `pyspark/errors/error_classes.py`
- Add error classes to identify the PySpark-specific errors.
- `python/pyspark/testing/utils.py`
- Add `checkError` to test errors with `error_class` and
`message_parameter` instead of error message.
- `python/pyspark/sql/tests/test_functions.py`
- Add & modify the tests by using `checkError`.
This is an initial PR for introducing error framework for PySpark to
facilitate the error management and provide better/consistent error messages to
users.
While such an active work is being done on the [SQL side to improve error
messages](https://issues.apache.org/jira/browse/SPARK-37935), so far there is
no work to improve error messages in PySpark.
Next up for this PR include:
- Migrate more Python built-in exceptions generated by driver side into
PySpark-specific errors.
- Migrate the errors generated by `Py4J` into PySpark-specific errors.
- Migrate the errors generated by Python worker side into PySpark-specific
errors.
- Migrate more error tests into tests using `checkError`.
- Currently all PySpark-specific errors are defined as `PySparkException`
class. As the number of PySpark-specific errors increases in the future, it may
be necessary to further refine the `PySparkException` into multiple categories
### Why are the changes needed?
Centralizing error messages & introducing identified error class provides
the following benefits:
- Errors are searchable via the unique class names and properly classified.
- Reduce the cost of future maintenance for PySpark errors.
- Provide consistent & actionable error messages to users.
- Facilitates translating error messages into different languages.
### Does this PR introduce _any_ user-facing change?
Yes, but only for error message. No API changes at all.
For example,
**Before**
```python
>>> from pyspark.sql import functions as F
>>> F.window("date", 5)
Traceback (most recent call last):
...
TypeError: windowDuration should be provided as a string
```
**After**
```python
>>> from pyspark.sql import functions as F
>>> F.window("date", 5)
Traceback (most recent call last):
...
pyspark.errors.PySparkException: [NOT_A_STRING] Argument 'windowDuration'
should be a string, got 'int'.
```
### How was this patch tested?
By adding unittests, and existing static analysis tools (`dev/lint-python`)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]