GitHub user rajeshbalamohan opened a pull request:

    https://github.com/apache/spark/pull/19184

    [SPARK-21971][CORE] Too many open files in Spark due to concurrent fi…

    …les being opened
    
    ## What changes were proposed in this pull request?
    
    In UnsafeExternalSorter::getIterator(), for every spillWriter a file is 
opened in UnsafeSorterSpillReader and these files get closed later point in 
time as a part of close() call. 
    However, when large number of spill files are present, number of files 
opened increases to a great extent and ends up throwing "Too many files" open 
exception.
    This can easily be reproduced with TPC-DS Q67 at 1 TB scale in multi node 
cluster with multiple cores per executor. 
    
    There are ways to reduce the number of spill files that are generated in 
Q67. E.g, increase "spark.sql.windowExec.buffer.spill.threshold" where 4096 is 
the default. Another option is to increase ulimit to much higher values.
    But those are workarounds. 
    
    This PR reduces the number of files that are kept open at in 
UnsafeSorterSpillReader.
    
    
    ## How was this patch tested?
    Manual testing of Q67 in 1 TB and 10 TB scale on multi node cluster.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rajeshbalamohan/spark SPARK-21971

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19184.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19184
    
----
commit dcc2960d5f60add9bfd9446df59b0d0d06365947
Author: Rajesh Balamohan <rbalamo...@apache.org>
Date:   2017-09-11T01:36:12Z

    [SPARK-21971][CORE] Too many open files in Spark due to concurrent files 
being opened

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to