bhasudha commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1920095297

   Hi  @bk-mz  . Wanted to add to this thread. Query latency may not be the 
only metric to measure like explained in the above threads. The runs with 
parquet native bloom filters enabled and still taking similar time could be 
dominated by few factors: the need to still open all files to load the parquet 
native bloom filter, S3 throttling etc. 
   
   One way I would try testing this is to remove Hudi from the picture and take 
the same parquet dataset, and run it with and without parquet native bloom 
filter enabled. You should be able to see the output rows reduced, but the 
query time may not be that improved due to the need to load each of these files 
to read the bloom filters. 
   
   The Column stats in Hudi's metadata table helps to reduce the number of 
files scanned (unlike parquet native bloom filters).   With data skipping 
enabled, Hudi uses the column stats stored in the metadata table instead of 
scanning the metadata in each parquet file, so Hudi can better plan the query 
with such stats and the predicates by scanning/reading fewer files when 
possible (see this 
[blog](https://www.onehouse.ai/blog/hudis-column-stats-index-and-data-skipping-feature-help-speed-up-queries-by-an-orders-of-magnitude)
 for more details on data skipping in Hudi).  This is particularly helpful on 
cloud storage as cloud storage requests have constant overhead and are subject 
to rate limiting. 
   
   You bring valid feedback that we will take and work on - better showcasing 
the impact of using these indexes so the users can easily spot them. Will 
update you back on how we are incorporating this shortly.
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to