bhasudha commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1920095297
Hi @bk-mz . Wanted to add to this thread. Query latency may not be the only metric to measure like explained in the above threads. The runs with parquet native bloom filters enabled and still taking similar time could be dominated by few factors: the need to still open all files to load the parquet native bloom filter, S3 throttling etc. One way I would try testing this is to remove Hudi from the picture and take the same parquet dataset, and run it with and without parquet native bloom filter enabled. You should be able to see the output rows reduced, but the query time may not be that improved due to the need to load each of these files to read the bloom filters. The Column stats in Hudi's metadata table helps to reduce the number of files scanned (unlike parquet native bloom filters). With data skipping enabled, Hudi uses the column stats stored in the metadata table instead of scanning the metadata in each parquet file, so Hudi can better plan the query with such stats and the predicates by scanning/reading fewer files when possible (see this [blog](https://www.onehouse.ai/blog/hudis-column-stats-index-and-data-skipping-feature-help-speed-up-queries-by-an-orders-of-magnitude) for more details on data skipping in Hudi). This is particularly helpful on cloud storage as cloud storage requests have constant overhead and are subject to rate limiting. You bring valid feedback that we will take and work on - better showcasing the impact of using these indexes so the users can easily spot them. Will update you back on how we are incorporating this shortly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
