vinothchandar commented on pull request #4226: URL: https://github.com/apache/hudi/pull/4226#issuecomment-989162876
Tee-ing off @prashantwason 's points. For queries that process more data, majority of the time spent is going to be on I/O, reading files. And Hudi provides the widest set of data plane capabilities in any open data lake storage at this time - clustering, compaction, space-curves, indexing - to optimize the I/O Adding some real numbers here for context. The story around file listing performance has been conveyed in somewhat misleading ways in the past and we have an opportunity to educate the right way. - File listings are pretty fast in general even in cloud storage like s3. For e.g here are the p50 list latencies for a single folder with 100, 1K, 10K, 100K files/objects is 50ms, 131ms, 1062ms, 9932ms. Scales linearly. On systems like HDFS, it's going to be even faster, since it's all cached on a JVM ``` FSBenchmark.listStatus:listStatus·p0.50 s3://8l-tpcds-data/s3_bench ./s3-bench-conf.properties 1 100000 run-1621043108527 sample 9932.112 ms/op FSBenchmark.listStatusLatency:listStatusLatency·p0.50 s3://8l-tpcds-data/s3_bench ./s3-bench-conf.properties 1 10000 run-1621027957414 sample 1062.207 ms/op FSBenchmark.listStatus:listStatus·p0.50 s3://8l-tpcds-data/s3_bench ./s3-bench-conf.properties 1 1000 run-1621033979258 sample 131.924 ms/op FSBenchmark.listStatus:listStatus·p0.50 s3://8l-tpcds-data/s3_bench ./s3-bench-conf.properties 1 100 run-1621033449117 sample 55.312 ms/op ``` - So the real problem is actually just the throttling of the list requests when do it hierarchically on multi-level directory prefixes. I am not sure if Cloud stores other than S3 have this sort of throttling issues. Hudi improves file listing performance in the following ways. - Metadata table, provides listings from an internal metadata table, can take about `100-500ms` per read, even for very large tables. - Timeline server caches portions of the metadata (currently only for writers), and provides ~10ms performance for listings and that's how Hudi is best suited for large scale incremental/streaming pipelines on the lake. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
