ubyyj commented on issue #3636: URL: https://github.com/apache/iceberg/issues/3636#issuecomment-1097823775
I believe it happens because there was too many open files and close them too late. To work around the issue, you can control the thread pool size as below: when start spark sql `--conf "spark.executor.extraJavaOptions=-Diceberg.worker.num-threads=1" --conf "spark.driver.extraJavaOptions=-Diceberg.worker.num-threads=1"` It happens this way: In class ManifestFilterManager 1, filterManifests(), invokes filterManifest() in parallel via ThreadPools.getWorkerPool(). 2, filterManifest() create ManifestReader in a try() block, then it invokes filterManifestWithDeletedFiles() 3, filterManifestWithDeletedFiles() further create another CloseableIterable via reader.entries(), and it opens the .avro files, and those connections to .avro files will not closed until the try() block in step finishes. we can also trigger close() in filterManifestWithDeletedFiles() to mitigate the issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
