[GitHub] [spark] ankurdave opened a new pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

GitBox Fri, 22 Oct 2021 07:08:08 -0700


ankurdave opened a new pull request #34369:
URL: https://github.com/apache/spark/pull/34369



   ### What changes were proposed in this pull request?
   
   The previous PR https://github.com/apache/spark/pull/34245 assumed task 
completion listeners are registered bottom-up. 
`ParquetFileFormat#buildReaderWithPartitionValues()` violates this assumption 
by registering a task completion listener to close its output iterator lazily. 
Since task completion listeners are executed in reverse order of registration, 
this listener always runs before other listeners. When the downstream operator 
contains a Python UDF and the off-heap vectorized reader is enabled, this 
results in a use-after-free that causes a segfault.
   
   The fix is to close the output iterator using FileScanRDD's task completion 
listener.
   
   ### Why are the changes needed?
   
   Without this PR, the Python tests introduced in 
https://github.com/apache/spark/pull/34245 are flaky ([see details in 
thread](https://github.com/apache/spark/pull/34245#issuecomment-948713545)). 
They intermittently fail with a segfault.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Repeatedly ran one of the Python tests introduced in 
https://github.com/apache/spark/pull/34245 using the commands below. 
Previously, the test was flaky and failed after about 50 runs. With this PR, 
the test has not failed after 200+ runs.
   
   ```sh
   ./build/sbt -Phive clean package && ./build/sbt test:compile
   seq 1000 | parallel -j 8 --halt now,fail=1 'echo {#}; python/run-tests 
--testnames pyspark.sql.tests.test_udf'
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ankurdave opened a new pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

Reply via email to