alexeykudinkin commented on issue #5808: URL: https://github.com/apache/hudi/issues/5808#issuecomment-1154216217
No worries! We're actually would be looking to support Bloom-filter index in the Data Skipping eventually as well, but this will def be a non-trivial challenge given the sheer difference in sizes b/w Bloom-filter index and Column Stats Indexes even for moderately sized tables. > think customer uild column for example - since it is a random string, column stats would be relatively useless - but bloom filter could skip 99% of all files when looking for a particular uuid. Or am I missing on how the column stats work - reading the code/metadata - they seem useful for monotonic or slowly changing columns - like dates or db FK's - where min/max stats in combination of clustering/sorting can do proper data skipping. You're right -- Data Skipping effectiveness is correlated to how disjoint individual file's ranges are for particular column. The opposite is also true -- if for column A ranges for every file are exactly the same, Data Skipping effectiveness will be practically 0 (we call it often "pruning potential"). As you rightfully noticed it's the most effective w/ ordered or semi-ordered columns, and therefore we usually recommend folks to think about clustering on particular columns they are querying most often to leverage full Data Skipping's potential (especially given that since 0.10 Hudi now have spatial-curves like Z-order, Hilbert supported in its clustering suite). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
