[PR] [WIP][SPARK-47274][PYTHON][CONNECT] Provide more useful context for PySpark DataFrame API errors [spark]

via GitHub Mon, 04 Mar 2024 16:37:53 -0800


itholic opened a new pull request, #45377:
URL: https://github.com/apache/spark/pull/45377


   ### What changes were proposed in this pull request?
   
   This PR introduces an enhancement to the error messages generated by 
PySpark's DataFrame API, adding detailed context about the location within the 
user's PySpark code where the error occurred.
   
   This follows a similar improvement done on the JVM side for the Dataset API 
as described in https://github.com/apache/spark/pull/43334, aiming to provide 
PySpark users with the same level of detailed error context for better 
usability and debugging efficiency.
   
   
   ### Why are the changes needed?
   
   To improve a debuggability. Errors originating from PySpark operations can 
be difficult to debug with limited context in the error messages. While 
improvements on the JVM side have been made to offer detailed error contexts, 
PySpark errors often lack this level of detail. 
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No API changes, but error messages will include a reference to the exact 
line of user code that triggered the error, in addition to the existing 
descriptive error message.
   
   For example, consider the following PySpark code snippet that triggers a 
`DIVIDE_BY_ZERO` error:
   
   ```python
   1  from pyspark.sql import SparkSession
   2  from pyspark.sql.functions import col
   3  
   4  spark = SparkSession.builder.appName("ExampleApp").getOrCreate()
   5  spark.conf.set("spark.sql.ansi.enabled", True)
   6  
   7  df = spark.range(10)
   8  df.select(col("id") / 0).show()
   ```
   
   **Before:**
   ```
   pyspark.errors.exceptions.captured.ArithmeticException: [DIVIDE_BY_ZERO] 
Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL 
instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this 
error. SQLSTATE: 22012
   == DataFrame ==
   "divide" was called from
   java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
   ```
   
   **After:**
   ```
   pyspark.errors.exceptions.captured.ArithmeticException: [DIVIDE_BY_ZERO] 
Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL 
instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this 
error. SQLSTATE: 22012
   == DataFrame ==
   "divide" was called from
   java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
   
   == Error Location (PySpark Code) ==
   ['df.select(col("id") / 0).show()\n'] was called from 
/.../spark/python/test_pyspark_error.py:8
   ```
   
   
   ### How was this patch tested?
   
   Added UTs.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [WIP][SPARK-47274][PYTHON][CONNECT] Provide more useful context for PySpark DataFrame API errors [spark]

Reply via email to