jspaine opened a new issue, #9271:
URL: https://github.com/apache/hudi/issues/9271

   **Describe the problem you faced**
   
   Enabling the bloom index metadata options below on a large partitioned table 
causes upserted rows to be duplicated and deletions to fail to remove most 
records. Re-creating the table unpartitioned, or with the index metadata 
options disabled, resolves the issue.
   
   ```
   "hoodie.bloom.index.use.metadata": "true",
   "hoodie.metadata.index.bloom.filter.enable": "true",
   ```
   
   **To Reproduce**
   
   I'm not sure how easy it is to reproduce, but it was very repeatable with my 
dataset:
   
   The table is almost 500M rows, 104 columns, partitioned by year, month into 
~150 partitions. There are a few hundred rows up to 2015 then exponentially 
increasing with half the total rows in the current year.
   
   It seemed that in each commit, most of the rows to be updated were inserted 
as duplicates, and most rows to be deleted were not deleted. Checking one of 
the (hundreds of thousands) of duplicated rows showed the records are in 
separate files, and the filename of the original record didn't appear in the 
commit log:
   
   
![20230720_09h00m15s_grim](https://github.com/apache/hudi/assets/6288863/f89983e3-733b-4a96-bca1-f679de4357d5)
   
   
   Just tried again with the dev version of the table (3M rows with much slower 
growth and less partition skewing), doing a handful of updates/deletes, and 
still can't see the issue there. 
   
   Steps to reproduce the behavior:
   
   1. Load existing data using insert with index metadata options enabled
   2. Apply a batch of updates
   3. Select rows grouped by record key having count > 1
   4. See that many of the records that should have been updated have been 
duplicated
   
   Insert options:
   ```python
   {
           "hoodie.bloom.index.use.metadata": "true",
           "hoodie.database.name": "hudi_db",
           "hoodie.datasource.hive_sync.database": "hudi_db",
           "hoodie.datasource.hive_sync.enable": "true",
           "hoodie.datasource.hive_sync.mode": "hms",
           "hoodie.datasource.hive_sync.partition_extractor_class": 
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
           "hoodie.datasource.hive_sync.partition_fields": 
"partition_year,partition_month",
           "hoodie.datasource.hive_sync.table_properties": 
"hudi.metadata-listing-enabled=TRUE",
           "hoodie.datasource.hive_sync.table": "table",
           "hoodie.datasource.meta_sync.condition.sync": "true",
           "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.ComplexKeyGenerator",
           "hoodie.datasource.write.operation": "insert",
           "hoodie.datasource.write.partitionpath.field": 
"partition_year,partition_month",
           "hoodie.datasource.write.payload.class": 
"org.apache.hudi.common.model.DefaultHoodieRecordPayload",
           "hoodie.datasource.write.precombine.field": "_timestamp",
           "hoodie.datasource.write.reconcile.schema": "true",
           "hoodie.datasource.write.recordkey.field": "_id",
           "hoodie.enable.data.skipping": "true",
           "hoodie.index.type": "BLOOM",
           "hoodie.insert.shuffle.parallelism": "300",
           "hoodie.metadata.index.bloom.filter.enable": "true",
           "hoodie.metadata.index.column.stats.enable": "true",
           "hoodie.parquet.compression.codec": "SNAPPY",
           "hoodie.parquet.compression.ratio": "0.75",
           "hoodie.payload.ordering.field": "_timestamp",
           "hoodie.schema.on.read.enable": "true",
           "hoodie.table.name": "table",
           "path": f"s3://{config.output_bucket}/{TABLE_NAME}",
   }
   ```
   
   Upsert options:
   ```python
   {
        "hoodie.bloom.index.use.metadata": "true",
           "hoodie.database.name": "hudi_db",
        "hoodie.datasource.hive_sync.database": "hudi_db",
        "hoodie.datasource.hive_sync.enable": "true",
        "hoodie.datasource.hive_sync.mode": "hms",
        "hoodie.datasource.hive_sync.partition_extractor_class": 
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
        "hoodie.datasource.hive_sync.partition_fields": 
"partition_year,partition_month",
        "hoodie.datasource.hive_sync.table_properties": 
"hudi.metadata-listing-enabled=TRUE",
        "hoodie.datasource.hive_sync.table": "table",
        "hoodie.datasource.meta_sync.condition.sync": "true",
        "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.ComplexKeyGenerator",
        "hoodie.datasource.write.partitionpath.field": 
"partition_year,partition_month",
        "hoodie.datasource.write.payload.class": 
"org.apache.hudi.common.model.DefaultHoodieRecordPayload",
        "hoodie.datasource.write.precombine.field": "_timestamp",
        "hoodie.datasource.write.reconcile.schema": "true",
        "hoodie.datasource.write.recordkey.field": "_id",
        "hoodie.enable.data.skipping": "true",
        "hoodie.index.type": "BLOOM",
        "hoodie.metadata.index.bloom.filter.enable": "true",
        "hoodie.metadata.index.column.stats.enable": "true",
        "hoodie.metrics.cloudwatch.metric.prefix": "hudi_db.table",
        "hoodie.metrics.on": "true" if config.enable_monitoring else "false",
        "hoodie.metrics.reporter.type": "CLOUDWATCH",
        "hoodie.parquet.compression.codec": "SNAPPY",
        "hoodie.parquet.compression.ratio": "0.75",
        "hoodie.payload.ordering.field": "_timestamp",
        "hoodie.schema.on.read.enable": "true",
        "hoodie.table.name": "table",
        "hoodie.upsert.shuffle.parallelism": "100",
        "path": f"s3://{config.output_bucket}/{TABLE_NAME}",
   }
   ```
   
   **Expected behavior**
   
   No duplicates are produced when updating records
   
   **Environment Description**
   
   * Hudi version : 0.13.0 (EMR serverless 6.11.0, also tried hudi 0.12.2 on 
EMR serverless 6.10.0)
   
   * Spark version : 3.3.2
   
   * Hive version : 3.1.3
   
   * Hadoop version : 3.3.3
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : yes
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to