jspaine opened a new issue, #9271: URL: https://github.com/apache/hudi/issues/9271
**Describe the problem you faced** Enabling the bloom index metadata options below on a large partitioned table causes upserted rows to be duplicated and deletions to fail to remove most records. Re-creating the table unpartitioned, or with the index metadata options disabled, resolves the issue. ``` "hoodie.bloom.index.use.metadata": "true", "hoodie.metadata.index.bloom.filter.enable": "true", ``` **To Reproduce** I'm not sure how easy it is to reproduce, but it was very repeatable with my dataset: The table is almost 500M rows, 104 columns, partitioned by year, month into ~150 partitions. There are a few hundred rows up to 2015 then exponentially increasing with half the total rows in the current year. It seemed that in each commit, most of the rows to be updated were inserted as duplicates, and most rows to be deleted were not deleted. Checking one of the (hundreds of thousands) of duplicated rows showed the records are in separate files, and the filename of the original record didn't appear in the commit log:  Just tried again with the dev version of the table (3M rows with much slower growth and less partition skewing), doing a handful of updates/deletes, and still can't see the issue there. Steps to reproduce the behavior: 1. Load existing data using insert with index metadata options enabled 2. Apply a batch of updates 3. Select rows grouped by record key having count > 1 4. See that many of the records that should have been updated have been duplicated Insert options: ```python { "hoodie.bloom.index.use.metadata": "true", "hoodie.database.name": "hudi_db", "hoodie.datasource.hive_sync.database": "hudi_db", "hoodie.datasource.hive_sync.enable": "true", "hoodie.datasource.hive_sync.mode": "hms", "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor", "hoodie.datasource.hive_sync.partition_fields": "partition_year,partition_month", "hoodie.datasource.hive_sync.table_properties": "hudi.metadata-listing-enabled=TRUE", "hoodie.datasource.hive_sync.table": "table", "hoodie.datasource.meta_sync.condition.sync": "true", "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator", "hoodie.datasource.write.operation": "insert", "hoodie.datasource.write.partitionpath.field": "partition_year,partition_month", "hoodie.datasource.write.payload.class": "org.apache.hudi.common.model.DefaultHoodieRecordPayload", "hoodie.datasource.write.precombine.field": "_timestamp", "hoodie.datasource.write.reconcile.schema": "true", "hoodie.datasource.write.recordkey.field": "_id", "hoodie.enable.data.skipping": "true", "hoodie.index.type": "BLOOM", "hoodie.insert.shuffle.parallelism": "300", "hoodie.metadata.index.bloom.filter.enable": "true", "hoodie.metadata.index.column.stats.enable": "true", "hoodie.parquet.compression.codec": "SNAPPY", "hoodie.parquet.compression.ratio": "0.75", "hoodie.payload.ordering.field": "_timestamp", "hoodie.schema.on.read.enable": "true", "hoodie.table.name": "table", "path": f"s3://{config.output_bucket}/{TABLE_NAME}", } ``` Upsert options: ```python { "hoodie.bloom.index.use.metadata": "true", "hoodie.database.name": "hudi_db", "hoodie.datasource.hive_sync.database": "hudi_db", "hoodie.datasource.hive_sync.enable": "true", "hoodie.datasource.hive_sync.mode": "hms", "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor", "hoodie.datasource.hive_sync.partition_fields": "partition_year,partition_month", "hoodie.datasource.hive_sync.table_properties": "hudi.metadata-listing-enabled=TRUE", "hoodie.datasource.hive_sync.table": "table", "hoodie.datasource.meta_sync.condition.sync": "true", "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator", "hoodie.datasource.write.partitionpath.field": "partition_year,partition_month", "hoodie.datasource.write.payload.class": "org.apache.hudi.common.model.DefaultHoodieRecordPayload", "hoodie.datasource.write.precombine.field": "_timestamp", "hoodie.datasource.write.reconcile.schema": "true", "hoodie.datasource.write.recordkey.field": "_id", "hoodie.enable.data.skipping": "true", "hoodie.index.type": "BLOOM", "hoodie.metadata.index.bloom.filter.enable": "true", "hoodie.metadata.index.column.stats.enable": "true", "hoodie.metrics.cloudwatch.metric.prefix": "hudi_db.table", "hoodie.metrics.on": "true" if config.enable_monitoring else "false", "hoodie.metrics.reporter.type": "CLOUDWATCH", "hoodie.parquet.compression.codec": "SNAPPY", "hoodie.parquet.compression.ratio": "0.75", "hoodie.payload.ordering.field": "_timestamp", "hoodie.schema.on.read.enable": "true", "hoodie.table.name": "table", "hoodie.upsert.shuffle.parallelism": "100", "path": f"s3://{config.output_bucket}/{TABLE_NAME}", } ``` **Expected behavior** No duplicates are produced when updating records **Environment Description** * Hudi version : 0.13.0 (EMR serverless 6.11.0, also tried hudi 0.12.2 on EMR serverless 6.10.0) * Spark version : 3.3.2 * Hive version : 3.1.3 * Hadoop version : 3.3.3 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : yes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
