itholic opened a new pull request, #45377: URL: https://github.com/apache/spark/pull/45377
### What changes were proposed in this pull request? This PR introduces an enhancement to the error messages generated by PySpark's DataFrame API, adding detailed context about the location within the user's PySpark code where the error occurred. This follows a similar improvement done on the JVM side for the Dataset API as described in https://github.com/apache/spark/pull/43334, aiming to provide PySpark users with the same level of detailed error context for better usability and debugging efficiency. ### Why are the changes needed? To improve a debuggability. Errors originating from PySpark operations can be difficult to debug with limited context in the error messages. While improvements on the JVM side have been made to offer detailed error contexts, PySpark errors often lack this level of detail. ### Does this PR introduce _any_ user-facing change? No API changes, but error messages will include a reference to the exact line of user code that triggered the error, in addition to the existing descriptive error message. For example, consider the following PySpark code snippet that triggers a `DIVIDE_BY_ZERO` error: ```python 1 from pyspark.sql import SparkSession 2 from pyspark.sql.functions import col 3 4 spark = SparkSession.builder.appName("ExampleApp").getOrCreate() 5 spark.conf.set("spark.sql.ansi.enabled", True) 6 7 df = spark.range(10) 8 df.select(col("id") / 0).show() ``` **Before:** ``` pyspark.errors.exceptions.captured.ArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. SQLSTATE: 22012 == DataFrame == "divide" was called from java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ``` **After:** ``` pyspark.errors.exceptions.captured.ArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. SQLSTATE: 22012 == DataFrame == "divide" was called from java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) == Error Location (PySpark Code) == ['df.select(col("id") / 0).show()\n'] was called from /.../spark/python/test_pyspark_error.py:8 ``` ### How was this patch tested? Added UTs. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
