Re: [PR] [SPARK-45452][SQL] Improve `InMemoryFileIndex` to use `FileSystem.listFiles` API [spark]

via GitHub Wed, 03 Jan 2024 20:48:52 -0800


dongjoon-hyun commented on PR #43261:
URL: https://github.com/apache/spark/pull/43261#issuecomment-1876307415


   As you see in this PR description, the listing time of 51842 files was 11 
seconds in my experiment.
   > For example, the improvement on a 3-year-data table with 
year/month/day/hour hierarchy is 158x.
   
   Given your description, I guess `100s` (the 10x result) means about `520k` 
files. In other words, did you try to list 0.5M files or 1M files in a single 
S3 prefix? Could you double-check with the internal benchmark team, please?
   
   In addition, in your benchmark case, do you have some other S3 page size or 
S3-compatible layer like DBFS in your benchmark environment? It could have a 
side-effect due to the layered file system.
   
   Please shed some lights to us. 😄 I'd like to provide a proper action on this 
topic because this is really helpful for the most of  tables and users.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-45452][SQL] Improve `InMemoryFileIndex` to use `FileSystem.listFiles` API [spark]

Reply via email to