ScrapCodes commented on issue #24585: [Spark-27664][SQL] Performance issue while listing large number of files on an object store. URL: https://github.com/apache/spark/pull/24585#issuecomment-493425230 One simple way to reproduce it is, # 1. Create a large listing in object store. ``` spark.range(1, 400000, 1, 100000).write.mode("overwrite").save("cos://bucket.service/test/") ``` # 2. Fetch the listing using spark. ``` spark.read.parquet("cos://bucket.service/test/").limit(1).show(). ``` When the number of objects are about 100K +, we will see eviction warnings from FileStatusCache [link](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileStatusCache.scala#L126), increase the cache `spark.sql.hive.filesourcePartitionFileCacheSize`. But if we configure `concurrencyLevel` to 1, as described in this PR, cache works and is able to fit the entire content in memory, _without increasing the size of the cache_. The reason for this is already explained in the JIRA and [1] reference in the PR description.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
