[Action Required] Spark Bloom Index Metadata Regression in 0.12

Alexey Kudinkin Tue, 11 Oct 2022 12:15:58 -0700

Hello, everyone!

Recently a regression in Hudi 0.12 release was discovered related to Bloom
Index metadata persisted w/in Parquet footers (HUDI-4992
<https://issues.apache.org/jira/browse/HUDI-4992>).


Crux of the problem was that min/max statistics for the record keys were
computed incorrectly during (Spark-specific) row-writing
<https://hudi.apache.org/docs/next/configurations#hoodiedatasourcewriterowwriterenable>
Bulk Insert operation affecting Key Range Pruning flow
<https://hudi.apache.org/docs/next/basic_configurations/#hoodiebloomindexprunebyranges>
w/in Hoodie Bloom Index
<https://hudi.apache.org/docs/next/faq/#how-do-i-configure-bloom-filter-when-bloomglobal_bloom-index-is-used>
tagging sequence, resulting into updated records being incorrectly tagged
as "inserts" and not as "updates", leading to duplicated records in the
table.

PR <https://github.com/apache/hudi/pull/6883> addressing the problem has
already been landed on master and *is also going to be incorporated into
upcoming Hudi 0.12.1 release.*

If all of the following is applicable to you:

   1. Using Spark as an execution engine
   2. Using Bulk Insert (using row-writing
   
<https://hudi.apache.org/docs/next/configurations#hoodiedatasourcewriterowwriterenable>,
   enabled *by default*)
   3. Using Bloom Index (with range-pruning
   
<https://hudi.apache.org/docs/next/basic_configurations/#hoodiebloomindexprunebyranges>
   enabled, enabled *by default*) for "UPSERT" operations

Please consider one of the following potential remediations to avoid
getting duplicate records in your pipeline:

   - Disabling Bloom Index range-pruning
   
<https://hudi.apache.org/docs/next/basic_configurations/#hoodiebloomindexprunebyranges>
flow (might
   affect performance of upsert operations)
   - Upgrading to 0.12.1 (which is targeted to be released this week)
   - Making sure that the fix <https://github.com/apache/hudi/pull/6883> is
   included in your custom artifacts (if you're building and using ones)


Please, let me know if you have any questions

[Action Required] Spark Bloom Index Metadata Regression in 0.12

Reply via email to