HyukjinKwon opened a new pull request #30309:
URL: https://github.com/apache/spark/pull/30309


   ### What changes were proposed in this pull request?
   
   This PR proposes to simplify the exception messages from Python UDFS. 
   
   Currently, the exception message from Python UDFs is as below:
   
   ```python
   from pyspark.sql.functions import udf; spark.range(10).select(udf(lambda x: 
x/0)("id")).collect()
   ```
   
   ```python
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/.../python/pyspark/sql/dataframe.py", line 427, in show
       print(self._jdf.showString(n, 20, vertical))
     File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 
1305, in __call__
     File "/.../python/pyspark/sql/utils.py", line 127, in deco
       raise_from(converted)
     File "<string>", line 3, in raise_from
   pyspark.sql.utils.PythonException:
     An exception was thrown from Python worker in the executor:
   Traceback (most recent call last):
     File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 605, in main
       process()
     File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 597, in process
       serializer.dump_stream(out_iter, outfile)
     File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 223, in 
dump_stream
       self.serializer.dump_stream(self._batched(iterator), stream)
     File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 141, in 
dump_stream
       for obj in iterator:
     File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 212, in 
_batched
       for item in iterator:
     File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper
       result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in 
udfs)
     File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in 
<genexpr>
       result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in 
udfs)
     File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda>
       return lambda *a: f(*a)
     File "/.../python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper
       return f(*args, **kwargs)
     File "<stdin>", line 1, in <lambda>
   ZeroDivisionError: division by zero
   ```
   
   Actually, almost all cases, users only care about `ZeroDivisionError: 
division by zero`. We don't really have to show the internal stuff in 99% 
cases. 
   
   This PR adds a configuration 
`spark.sql.execution.pyspark.udf.simplifiedException.enabled` (disabled by 
default) that hides the internal tracebacks related to Python worker, 
(de)serialization, etc.
   
   ```python
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/.../python/pyspark/sql/dataframe.py", line 427, in show
       print(self._jdf.showString(n, 20, vertical))
     File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 
1305, in __call__
     File "/.../python/pyspark/sql/utils.py", line 127, in deco
       raise_from(converted)
     File "<string>", line 3, in raise_from
   pyspark.sql.utils.PythonException:
     An exception was thrown from Python worker in the executor:
   Traceback (most recent call last):
     File "<stdin>", line 1, in <lambda>
   ZeroDivisionError: division by zero
   ```
   
   The trackback will be shown from the point when any non-PySpark file is seen 
in the traceback.
   
   ### Why are the changes needed?
   
   Without this configuration. such internal tracebacks are exposed to users 
directly especially for shall or notebook users in PySpark. 99% cases people 
don't care about the internal Python worker, (de)serialization and related 
tracebacks. It just makes the exception more difficult to read. For example, 
one statement of `x/0` above shows a very long traceback and most of them are 
unnecessary. 
   
   This configuration enables the ability to show simplified tracebacks which 
users will likely be most interested in.
   
   ### Does this PR introduce _any_ user-facing change?
   
   By default, no. It adds one configuration that simplifies the exception 
message. See the example above.
   
   ### How was this patch tested?
   
   Manually tested:
   
   ```bash
   $ pyspark --conf 
spark.sql.execution.pyspark.udf.simplifiedException.enabled=true
   ```
   ```python
   from pyspark.sql.functions import udf; spark.range(10).select(udf(lambda x: 
x/0)("id")).collect()
   ```
   
   and unittests were also added.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to