ankurdave opened a new pull request #34369: URL: https://github.com/apache/spark/pull/34369
### What changes were proposed in this pull request? The previous PR https://github.com/apache/spark/pull/34245 assumed task completion listeners are registered bottom-up. `ParquetFileFormat#buildReaderWithPartitionValues()` violates this assumption by registering a task completion listener to close its output iterator lazily. Since task completion listeners are executed in reverse order of registration, this listener always runs before other listeners. When the downstream operator contains a Python UDF and the off-heap vectorized reader is enabled, this results in a use-after-free that causes a segfault. The fix is to close the output iterator using FileScanRDD's task completion listener. ### Why are the changes needed? Without this PR, the Python tests introduced in https://github.com/apache/spark/pull/34245 are flaky ([see details in thread](https://github.com/apache/spark/pull/34245#issuecomment-948713545)). They intermittently fail with a segfault. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Repeatedly ran one of the Python tests introduced in https://github.com/apache/spark/pull/34245 using the commands below. Previously, the test was flaky and failed after about 50 runs. With this PR, the test has not failed after 200+ runs. ```sh ./build/sbt -Phive clean package && ./build/sbt test:compile seq 1000 | parallel -j 8 --halt now,fail=1 'echo {#}; python/run-tests --testnames pyspark.sql.tests.test_udf' ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
