dongjoon-hyun commented on PR #43261: URL: https://github.com/apache/spark/pull/43261#issuecomment-1876307415
As you see in this PR description, the listing time of 51842 files was 11 seconds in my experiment. > For example, the improvement on a 3-year-data table with year/month/day/hour hierarchy is 158x. Given your description, I guess `100s` (the 10x result) means about `520k` files. In other words, did you try to list 0.5M files or 1M files in a single S3 prefix? Could you double-check with the internal benchmark team, please? In addition, in your benchmark case, do you have some other S3 page size or S3-compatible layer like DBFS in your benchmark environment? It could have a side-effect due to the layered file system. Please shed some lights to us. 😄 I'd like to provide a proper action on this topic because this is really helpful for the most of tables and users. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
