Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

via GitHub Sat, 06 Apr 2024 22:09:11 -0700


itholic commented on PR #45377:
URL: https://github.com/apache/spark/pull/45377#issuecomment-2041313068


   Hmm... I faced some problem to resolve this case.
   
   PySpark provide logs to JVM at the time an expression is declared,
   but the actual execution order on the JVM side could be different from the 
declare order.
   
   For example, when running the example @ueshin provided:
   
   ```python
     1 spark.conf.set("spark.sql.ansi.enabled", True)
     2 df = spark.range(10)
     3 a = df.id / 10
     4 b = df.id / 0
     5 df.select(
     6   a,
     7   df.id + 4,
     8   b,
     9   df.id * 5
    10 ).show()
   ```
   
   Internally, the logging is processed as below:
   
   ```python
   # Logging call site from Python to JVM when define the expression in order:
   1: ("divide", "/test.py:3")
   2: ("divide", "/test.py:4")
   3: ("plus", "/test.py:7")
   4: ("multiply", "/test.py:9")
   
   # But analyzing the expression from JVM could have a different order from 
defining order from Python: 
   1: ArraySeq(org.apache.spark.sql.Column.divide(Column.scala:790), 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method))
   2: ArraySeq(org.apache.spark.sql.Column.plus(Column.scala:700), 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method))
   3: ArraySeq(org.apache.spark.sql.Column.divide(Column.scala:790), 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method))
   4: ArraySeq(org.apache.spark.sql.Column.multiply(Column.scala:760), 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method))
   ```
   
   To solve this problem,
   I think Python and JVM must share a "key" that can distinguish the unique 
value of each expression in `PySparkCurrentOrigin` and the JVM stackTrace at 
the time of declaring the expression.
   However, I can't think of a good way to make this possible at the moment.
   
   @ueshin @HyukjinKwon @cloud-fan could you please advise if there is happen 
to a good way to make this possible?
   
   Alternatively, the workaround that comes to my mind is to provide additional 
information to the log.
   If the log does not indicate the exact call site, it outputs candidates 
where an actual error may occur.
   For example:
   
   ```python
   >>> spark.conf.set("spark.sql.ansi.enabled", True)
   >>> df = spark.range(10)
   >>> a = df.id / 10
   >>> b = df.id / 0
   >>>
   >>> df.select(
   ...   a,
   ...   df.id + 4,
   ...   b,
   ...   df.id * 5
   ... ).show()
   pyspark.errors.exceptions.captured.ArithmeticException: [DIVIDE_BY_ZERO] 
Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL 
instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this 
error. SQLSTATE: 22012
   == DataFrame ==
   "plus" was called from
   <stdin>:3
   
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-47274][PYTHON][SQL] Provide more useful context for PySpark DataFrame API errors [spark]

Reply via email to