Hello, everyone! Recently a regression in Hudi 0.12 release was discovered related to Bloom Index metadata persisted w/in Parquet footers (HUDI-4992 <https://issues.apache.org/jira/browse/HUDI-4992>).
Crux of the problem was that min/max statistics for the record keys were computed incorrectly during (Spark-specific) row-writing <https://hudi.apache.org/docs/next/configurations#hoodiedatasourcewriterowwriterenable> Bulk Insert operation affecting Key Range Pruning flow <https://hudi.apache.org/docs/next/basic_configurations/#hoodiebloomindexprunebyranges> w/in Hoodie Bloom Index <https://hudi.apache.org/docs/next/faq/#how-do-i-configure-bloom-filter-when-bloomglobal_bloom-index-is-used> tagging sequence, resulting into updated records being incorrectly tagged as "inserts" and not as "updates", leading to duplicated records in the table. PR <https://github.com/apache/hudi/pull/6883> addressing the problem has already been landed on master and *is also going to be incorporated into upcoming Hudi 0.12.1 release.* If all of the following is applicable to you: 1. Using Spark as an execution engine 2. Using Bulk Insert (using row-writing <https://hudi.apache.org/docs/next/configurations#hoodiedatasourcewriterowwriterenable>, enabled *by default*) 3. Using Bloom Index (with range-pruning <https://hudi.apache.org/docs/next/basic_configurations/#hoodiebloomindexprunebyranges> enabled, enabled *by default*) for "UPSERT" operations Please consider one of the following potential remediations to avoid getting duplicate records in your pipeline: - Disabling Bloom Index range-pruning <https://hudi.apache.org/docs/next/basic_configurations/#hoodiebloomindexprunebyranges> flow (might affect performance of upsert operations) - Upgrading to 0.12.1 (which is targeted to be released this week) - Making sure that the fix <https://github.com/apache/hudi/pull/6883> is included in your custom artifacts (if you're building and using ones) Please, let me know if you have any questions