ScrapCodes opened a new pull request #24585: Performance issue while listing 
large number of files on an object store.
URL: https://github.com/apache/spark/pull/24585
 
 
   ## What changes were proposed in this pull request?
   
   Currently, Spark uses FileStatusCache to cache the listings while scanning a 
filesystem path. If this file system is on a remote storage like Object store 
(Amazon s3 or IBM COS), then this cache is of prime importance as it saves 
round trips of fetching listing over network over and over again.
   
   FileStatusCache uses guava cache underneath, which is configured with 
reasonably high default value. But, when remote listing is large >100K, the 
size requirement of this cache is also very high. Currently, this underlying 
guava cache is configured with default concurrency level of 4. The effect of 
this is, that a single entry can only be as large as less than 
`maxSizeOfCache/concurrencyLevel` [1]. Quite often, users have everything 
listed under a single directory or path on an object store, and as a result the 
entire fileStatus array containing 100k + entries is inserted as a single entry 
in the cache. So cache requirement grows more than 4x. 
   
   Please refer to Jira 
[link](https://issues.apache.org/jira/browse/SPARK-27664) for more detailed 
explanation. 
   
   In this patch, we make default concurrency level for underlying guava cache 
as 1 and makes it configurable, as this cache stores only a few but very large 
entries in reality. So the performance penalty will be very less, if any.
   
   I am open to work on an alternative solution as well, please feel free to 
discuss them.
   
   [1]. https://github.com/google/guava/issues/3462 
   
   ## How was this patch tested?
   
   Existing tests should pass.
   Manually verified the expected behaviour against a path with large listing ~ 
200K.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to