ueshin opened a new pull request #30177:
URL: https://github.com/apache/spark/pull/30177


   ### What changes were proposed in this pull request?
   
   As the Python evaluation consumes the parent iterator in a separate thread, 
it could consume more data from the parent even after the task ends and the 
parent is closed. Thus, we should use `ContextAwareIterator` to stop consuming 
after the task ends.
   
   ### Why are the changes needed?
   
   Python/Pandas UDF right after off-heap vectorized reader could cause 
executor crash.
   
   E.g.,:
   
   ```py
   spark.range(0, 100000, 1, 1).write.parquet(path)
   
   spark.conf.set("spark.sql.columnVector.offheap.enabled", True)
   
   def f(x):
       return 0
   
   fUdf = udf(f, LongType())
   
   spark.read.parquet(path).select(fUdf('id')).head()
   ```
   
   This is because, the Python evaluation consumes the parent iterator in a 
separate thread and it consumes more data from the parent even after the task 
ends and the parent is closed. If an off-heap column vector exists in the 
parent iterator, it could cause segmentation fault which crashes the executor.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Added tests, and manually.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to