GitHub user rajeshbalamohan opened a pull request: https://github.com/apache/spark/pull/19184
[SPARK-21971][CORE] Too many open files in Spark due to concurrent fi⦠â¦les being opened ## What changes were proposed in this pull request? In UnsafeExternalSorter::getIterator(), for every spillWriter a file is opened in UnsafeSorterSpillReader and these files get closed later point in time as a part of close() call. However, when large number of spill files are present, number of files opened increases to a great extent and ends up throwing "Too many files" open exception. This can easily be reproduced with TPC-DS Q67 at 1 TB scale in multi node cluster with multiple cores per executor. There are ways to reduce the number of spill files that are generated in Q67. E.g, increase "spark.sql.windowExec.buffer.spill.threshold" where 4096 is the default. Another option is to increase ulimit to much higher values. But those are workarounds. This PR reduces the number of files that are kept open at in UnsafeSorterSpillReader. ## How was this patch tested? Manual testing of Q67 in 1 TB and 10 TB scale on multi node cluster. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rajeshbalamohan/spark SPARK-21971 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19184.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19184 ---- commit dcc2960d5f60add9bfd9446df59b0d0d06365947 Author: Rajesh Balamohan <rbalamo...@apache.org> Date: 2017-09-11T01:36:12Z [SPARK-21971][CORE] Too many open files in Spark due to concurrent files being opened ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org