itholic commented on PR #45377:
URL: https://github.com/apache/spark/pull/45377#issuecomment-2041315119
Hmm... I faced some problem to resolve this case.
PySpark provide logs to JVM at the time an expression is declared,
but the actual execution order on the JVM side could be different from the
declare order.
For example, when running the example @ueshin provided:
```python
1 spark.conf.set("spark.sql.ansi.enabled", True)
2 df = spark.range(10)
3 a = df.id / 10
4 b = df.id / 0
5 df.select(
6 a,
7 df.id + 4,
8 b,
9 df.id * 5
10 ).show()
```
Internally, the logging is processed as below:
```python
# Logging call site from Python to JVM when define the expression in order:
1: ("divide", "/test.py:3")
2: ("divide", "/test.py:4")
3: ("plus", "/test.py:7")
4: ("multiply", "/test.py:9")
# But analyzing the expression from JVM could have a different order from
defining order from Python:
1: ArraySeq(org.apache.spark.sql.Column.divide(Column.scala:790),
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method))
2: ArraySeq(org.apache.spark.sql.Column.plus(Column.scala:700),
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method))
3: ArraySeq(org.apache.spark.sql.Column.divide(Column.scala:790),
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method))
4: ArraySeq(org.apache.spark.sql.Column.multiply(Column.scala:760),
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method))
```
To solve this problem,
I think Python and JVM must share a "key" that can distinguish the unique
value of each expression in `PySparkCurrentOrigin` and the JVM stackTrace at
the time of declaring the expression.
However, I can't think of a good way to make this possible at this moment.
@ueshin @HyukjinKwon @cloud-fan could you advise if there is happen to a
good way to make this possible?
## Workaround
Alternatively, the workaround that comes to my mind is to provide additional
information to the log.
And at least we can compare fragment values and output "divide" instead of
"plus", but the call site may be different:
For example, the suggested workaround would look like:
**In**
```python
1 spark.conf.set("spark.sql.ansi.enabled", True)
2 df = spark.range(10)
3 a = df.id / 10
4 b = df.id / 0
5 df.select(
6 a,
7 df.id + 4,
8 b,
9 df.id * 5
10 ).show()
```
**Out**
```
pyspark.errors.exceptions.captured.ArithmeticException: [DIVIDE_BY_ZERO]
Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL
instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this
error. SQLSTATE: 22012
== DataFrame ==
"divide" was called from
/test.py:3
== Other possible call sites ==
"divide" was called from
/test.py:4
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]