Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

via GitHub Sat, 06 Apr 2024 22:17:29 -0700


itholic commented on PR #45377:
URL: https://github.com/apache/spark/pull/45377#issuecomment-2041315119


   Hmm... I faced some problem to resolve this case.
   
   PySpark provide logs to JVM at the time an expression is declared,
   but the actual execution order on the JVM side could be different from the 
declare order.
   
   For example, when running the example @ueshin provided:
   
   ```python
     1 spark.conf.set("spark.sql.ansi.enabled", True)
     2 df = spark.range(10)
     3 a = df.id / 10
     4 b = df.id / 0
     5 df.select(
     6   a,
     7   df.id + 4,
     8   b,
     9   df.id * 5
    10 ).show()
   ```
   
   Internally, the logging is processed as below:
   
   ```python
   # Logging call site from Python to JVM when define the expression in order:
   1: ("divide", "/test.py:3")
   2: ("divide", "/test.py:4")
   3: ("plus", "/test.py:7")
   4: ("multiply", "/test.py:9")
   
   # But analyzing the expression from JVM could have a different order from 
defining order from Python: 
   1: ArraySeq(org.apache.spark.sql.Column.divide(Column.scala:790), 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method))
   2: ArraySeq(org.apache.spark.sql.Column.plus(Column.scala:700), 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method))
   3: ArraySeq(org.apache.spark.sql.Column.divide(Column.scala:790), 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method))
   4: ArraySeq(org.apache.spark.sql.Column.multiply(Column.scala:760), 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method))
   ```
   
   To solve this problem,
   I think Python and JVM must share a "key" that can distinguish the unique 
value of each expression in `PySparkCurrentOrigin` and the JVM stackTrace at 
the time of declaring the expression.
   However, I can't think of a good way to make this possible at this moment.
   
   @ueshin @HyukjinKwon @cloud-fan could you advise if there is happen to a 
good way to make this possible?
   
   
   ## Workaround
   
   Alternatively, the workaround that comes to my mind is to provide additional 
information to the log.
   And at least we can compare fragment values and output "divide" instead of 
"plus", but the call site may be different:
   
   For example, the suggested workaround would look like:
   
   **In**
   ```python
     1 spark.conf.set("spark.sql.ansi.enabled", True)
     2 df = spark.range(10)
     3 a = df.id / 10
     4 b = df.id / 0
     5 df.select(
     6   a,
     7   df.id + 4,
     8   b,
     9   df.id * 5
    10 ).show()
   ```
   **Out**
   ```
   pyspark.errors.exceptions.captured.ArithmeticException: [DIVIDE_BY_ZERO] 
Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL 
instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this 
error. SQLSTATE: 22012
   == DataFrame ==
   "divide" was called from
   /test.py:3
   
   == Other possible call sites ==
   "divide" was called from
   /test.py:4
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Reply via email to