HyukjinKwon opened a new pull request, #43778:
URL: https://github.com/apache/spark/pull/43778
### What changes were proposed in this pull request?
This PR improves the Python UDF error messages to be more actionable.
### Why are the changes needed?
Suppose you face a segfault error:
```python
from pyspark.sql.functions import udf
import ctypes
spark.range(1).select(udf(lambda x: ctypes.string_at(0))("id")).collect()
```
The current error message is not actionable:
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
...
get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling
o82.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 15
in stage 1.0 failed 1 times, most recent failure: Lost task 15.0 in stage 1.0
(TID 31) (192.168.123.102 executor driver): org.apache.spark.SparkException:
Python worker exited unexpectedly (crashed)
```
After this PR, it fixes the error message as below:
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
...
get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling
o59.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 15
in stage 0.0 failed 1 times, most recent failure: Lost task 15.0 in stage 0.0
(TID 15) (192.168.123.102 executor driver): org.apache.spark.SparkException:
Python worker exited unexpectedly (crashed). Consider setting
'spark.sql.execution.pyspark.udf.faulthandler.enabled'
or 'spark.python.worker.faulthandler.enabled' configuration to 'true' forthe
better Python traceback.
```
So you can try this out
```python
from pyspark.sql.functions import udf
import ctypes
spark.conf.set("spark.sql.execution.pyspark.udf.faulthandler.enabled",
"true")
spark.range(1).select(udf(lambda x: ctypes.string_at(0))("id")).collect()
```
that now shows where the segfault happens:
```
Caused by: org.apache.spark.SparkException: Python worker exited
unexpectedly (crashed): Fatal Python error: Segmentation fault
Current thread 0x00007ff84ae4b700 (most recent call first):
File "/.../envs/python3.9/lib/python3.9/ctypes/__init__.py", line 525 in
string_at
File "<stdin>", line 1 in <lambda>
File "/.../lib/pyspark.zip/pyspark/util.py", line 88 in wrapper
File "/.../lib/pyspark.zip/pyspark/worker.py", line 99 in <lambda>
File "/.../lib/pyspark.zip/pyspark/worker.py", line 1403 in <genexpr>
File "/.../lib/pyspark.zip/pyspark/worker.py", line 1403 in mapper
```
### Does this PR introduce _any_ user-facing change?
Yes, it fixes the error message actionable.
### How was this patch tested?
Manually tested as above.
### Was this patch authored or co-authored using generative AI tooling?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]