itholic opened a new pull request, #39137: URL: https://github.com/apache/spark/pull/39137
### What changes were proposed in this pull request? This PR proposes to introduce `pyspark.errors` and error classes to unifying & improving errors generated by PySpark under a single path. This PR includes the changes below: - Add error classes for PySpark and its sub error classes into `error-classes.json`. - Add `PySparkErrors` in JVM side to leverage the existing error framework - Add new module: `pyspark.errors` - Add new errors defined in `pyspark.errors.errors` that return the `PySparkException` by leveraging new error classes. - Migrate the error messages into error classes for `pyspark/sql/functions.py` - Add tests for migrated error messages. - Add test util `check_error` for testing errors by its error classes. This is an initial PR for introducing error framework for PySpark to facilitate the error management and provide better/consistent error messages to users. While such an active work is being done on the [SQL side to improve error messages](https://issues.apache.org/jira/browse/SPARK-37935), so far there is no work to improve error messages in PySpark. So, I'd expect to also initiate the effort on error message improvement for PySpark side from this PR. **Next up** for this PR include: - Migrate more Python built-in exceptions generated by driver side into PySpark-specific errors. - Migrate the errors generated by `Py4J` into PySpark-specific errors. - Migrate the errors generated by Python worker side into PySpark-specific errors. - Migrate more error tests into tests using `checkError`. - Currently all PySpark-specific errors are defined as `PySparkException` class. As the number of PySpark-specific errors increases in the future, it may be necessary to further refine the `PySparkException` into multiple categories - Add documentation Will add more items to [umbrella JIRA](https://issues.apache.org/jira/browse/SPARK-41597) once initial PR get approved. ### Why are the changes needed? Centralizing error messages & introducing identified error class provides the following benefits: - Errors are searchable via the unique class names and properly classified. - Reduce the cost of future maintenance for PySpark errors. - Provide consistent & actionable error messages to users. - Facilitates translating error messages into different languages. ### Does this PR introduce _any_ user-facing change? Yes, but only for error message. No API changes at all. For example, **Before** ```python >>> from pyspark.sql import functions as F >>> F.window("date", 5) Traceback (most recent call last): ... TypeError: windowDuration should be provided as a string ``` **After** ```python Traceback (most recent call last): ... pyspark.errors.exceptions.PySparkException: [PYSPARK.NOT_A_STRING] Argument 'windowDuration' should be a string, got 'int'. ``` ### How was this patch tested? By adding unittests and manually test the static analysis from `dev/lint-python` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
