HyukjinKwon opened a new pull request #32569:
URL: https://github.com/apache/spark/pull/32569


   ### What changes were proposed in this pull request?
   
   https://github.com/apache/spark/pull/30309 added a configuration (disabled 
by default) that simplifies the error messages from Python UDFS, which removed 
internal stacktrace from Python workers:
   
   ```
   from pyspark.sql.functions import udf; spark.range(10).select(udf(lambda x: 
x/0)("id")).collect()
   ```
   
   **Before**
   
   ```
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/.../python/pyspark/sql/dataframe.py", line 427, in show
       print(self._jdf.showString(n, 20, vertical))
     File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 
1305, in __call__
     File "/.../python/pyspark/sql/utils.py", line 127, in deco
       raise_from(converted)
     File "<string>", line 3, in raise_from
   pyspark.sql.utils.PythonException:
     An exception was thrown from Python worker in the executor:
   Traceback (most recent call last):
     File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 605, in main
       process()
     File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 597, in process
       serializer.dump_stream(out_iter, outfile)
     File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 223, in 
dump_stream
       self.serializer.dump_stream(self._batched(iterator), stream)
     File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 141, in 
dump_stream
       for obj in iterator:
     File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 212, in 
_batched
       for item in iterator:
     File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper
       result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in 
udfs)
     File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in 
<genexpr>
       result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in 
udfs)
     File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda>
       return lambda *a: f(*a)
     File "/.../python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper
       return f(*args, **kwargs)
     File "<stdin>", line 1, in <lambda>
   ZeroDivisionError: division by zero
   ```
   
   **After**
   
   ```
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/.../python/pyspark/sql/dataframe.py", line 427, in show
       print(self._jdf.showString(n, 20, vertical))
     File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 
1305, in __call__
     File "/.../python/pyspark/sql/utils.py", line 127, in deco
       raise_from(converted)
     File "<string>", line 3, in raise_from
   pyspark.sql.utils.PythonException:
     An exception was thrown from Python worker in the executor:
   Traceback (most recent call last):
     File "<stdin>", line 1, in <lambda>
   ZeroDivisionError: division by zero
   ```
   
   Note that the traceback (`return f(*args, **kwargs)`) is almost always same 
- I would say more 99%. For 1% case, we can guide developers to enable this 
configuration for further debugging.
   
   In Databricks, it has been enabled for around 6 months, and I have had zero 
negative feedback on it.
   
   ### Why are the changes needed?
   
   To show simplified exception messages to end users.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, it will hide the internal Python worker traceback.
   
   ### How was this patch tested?
   
   Existing test cases should cover.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to