[GitHub] [spark] ScrapCodes commented on issue #24585: [Spark-27664][SQL] Performance issue while listing large number of files on an object store.

GitBox Fri, 17 May 2019 04:49:07 -0700

ScrapCodes commented on issue #24585: [Spark-27664][SQL] Performance issue 
while listing large number of files on an object store.
URL: https://github.com/apache/spark/pull/24585#issuecomment-493425230
 
 
   One simple way to reproduce it is, 
   
   # 1. Create a large listing in object store. 
   ```
   spark.range(1, 400000, 1, 
100000).write.mode("overwrite").save("cos://bucket.service/test/")
   ```
   
   # 2. Fetch the listing using spark.
   ```
   spark.read.parquet("cos://bucket.service/test/").limit(1).show().
   ```
   When the number of objects are about 100K +, we will see eviction warnings 
from FileStatusCache 
[link](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileStatusCache.scala#L126),
 increase the cache `spark.sql.hive.filesourcePartitionFileCacheSize`.
   
   But if we configure `concurrencyLevel` to 1, as described in this PR, cache 
works and is able to fit the entire content in memory, _without increasing the 
size of the cache_. The reason for this is already explained in the JIRA and 
[1] reference in the PR description.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ScrapCodes commented on issue #24585: [Spark-27664][SQL] Performance issue while listing large number of files on an object store.

Reply via email to