alexeykudinkin commented on issue #5808:
URL: https://github.com/apache/hudi/issues/5808#issuecomment-1154216217

   No worries! We're actually would be looking to support Bloom-filter index in 
the Data Skipping eventually as well, but this will def be a non-trivial 
challenge given the sheer difference in sizes b/w Bloom-filter index and Column 
Stats Indexes even for moderately sized tables.
   
   > think customer uild column for example - since it is a random string, 
column stats would be relatively useless - but bloom filter could skip 99% of 
all files when looking for a particular uuid.
   Or am I missing on how the column stats work - reading the code/metadata - 
they seem useful for monotonic or slowly changing columns - like dates or db 
FK's - where min/max stats in combination of clustering/sorting can do proper 
data skipping.
   
   You're right -- Data Skipping effectiveness is correlated to how disjoint 
individual file's ranges are for particular column. The opposite is also true 
-- if for column A ranges for every file are exactly the same, Data Skipping 
effectiveness will be practically 0 (we call it often "pruning potential"). As 
you rightfully noticed it's the most effective w/ ordered or semi-ordered 
columns, and therefore we usually recommend folks to think about clustering on 
particular columns they are querying most often to leverage full Data 
Skipping's potential (especially given that since 0.10 Hudi now have 
spatial-curves like Z-order, Hilbert supported in its clustering suite).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to