Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

via GitHub Wed, 31 Jan 2024 14:31:55 -0800


bhasudha commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1920095297

Hi @bk-mz . Wanted to add to this thread. Query latency may not be the
only metric to measure like explained in the above threads. The runs with
parquet native bloom filters enabled and still taking similar time could be
dominated by few factors: the need to still open all files to load the parquet
native bloom filter, S3 throttling etc.

One way I would try testing this is to remove Hudi from the picture and take
the same parquet dataset, and run it with and without parquet native bloom
filter enabled. You should be able to see the output rows reduced, but the
query time may not be that improved due to the need to load each of these files
to read the bloom filters.

The Column stats in Hudi's metadata table helps to reduce the number of
files scanned (unlike parquet native bloom filters). With data skipping
enabled, Hudi uses the column stats stored in the metadata table instead of
scanning the metadata in each parquet file, so Hudi can better plan the query
with such stats and the predicates by scanning/reading fewer files when
possible (see this
[blog](https://www.onehouse.ai/blog/hudis-column-stats-index-and-data-skipping-feature-help-speed-up-queries-by-an-orders-of-magnitude)
for more details on data skipping in Hudi). This is particularly helpful on
cloud storage as cloud storage requests have constant overhead and are subject
to rate limiting.

You bring valid feedback that we will take and work on - better showcasing
the impact of using these indexes so the users can easily spot them. Will
update you back on how we are incorporating this shortly.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

Reply via email to