itholic commented on PR #45377: URL: https://github.com/apache/spark/pull/45377#issuecomment-2041313068
Hmm... I faced some problem to resolve this case. PySpark provide logs to JVM at the time an expression is declared, but the actual execution order on the JVM side could be different from the declare order. For example, when running the example @ueshin provided: ```python 1 spark.conf.set("spark.sql.ansi.enabled", True) 2 df = spark.range(10) 3 a = df.id / 10 4 b = df.id / 0 5 df.select( 6 a, 7 df.id + 4, 8 b, 9 df.id * 5 10 ).show() ``` Internally, the logging is processed as below: ```python # Logging call site from Python to JVM when define the expression in order: 1: ("divide", "/test.py:3") 2: ("divide", "/test.py:4") 3: ("plus", "/test.py:7") 4: ("multiply", "/test.py:9") # But analyzing the expression from JVM could have a different order from defining order from Python: 1: ArraySeq(org.apache.spark.sql.Column.divide(Column.scala:790), java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)) 2: ArraySeq(org.apache.spark.sql.Column.plus(Column.scala:700), java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)) 3: ArraySeq(org.apache.spark.sql.Column.divide(Column.scala:790), java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)) 4: ArraySeq(org.apache.spark.sql.Column.multiply(Column.scala:760), java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)) ``` To solve this problem, I think Python and JVM must share a "key" that can distinguish the unique value of each expression in `PySparkCurrentOrigin` and the JVM stackTrace at the time of declaring the expression. However, I can't think of a good way to make this possible at the moment. @ueshin @HyukjinKwon @cloud-fan could you please advise if there is happen to a good way to make this possible? Alternatively, the workaround that comes to my mind is to provide additional information to the log. If the log does not indicate the exact call site, it outputs candidates where an actual error may occur. For example: ```python >>> spark.conf.set("spark.sql.ansi.enabled", True) >>> df = spark.range(10) >>> a = df.id / 10 >>> b = df.id / 0 >>> >>> df.select( ... a, ... df.id + 4, ... b, ... df.id * 5 ... ).show() pyspark.errors.exceptions.captured.ArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. SQLSTATE: 22012 == DataFrame == "plus" was called from <stdin>:3 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org