[I] [SUPPORT] hoodie.combine.before.insert silently broken for bulk_insert if meta fields disabled (causes duplicates) [hudi]

via GitHub Wed, 17 Apr 2024 12:07:13 -0700


mzheng-plaid opened a new issue, #11044:
URL: https://github.com/apache/hudi/issues/11044


   **Describe the problem you faced**
   
   `hoodie.combine.before.insert` works with `bulk_insert` if the meta fields 
are enabled but silently does not work and causes duplicates if disabled (ie. 
`"hoodie.populate.meta.fields": "false"`). 
   
   **To Reproduce**
   I provided a trivial reproduction below (`hoodie.populate.meta.fields` seems 
like only option that matters on whether bug happens):
   ```
   # Generate dummy data
   from pyspark.sql import Row
   
   input_data = [
       Row(
           id=4,
           value="foo",
           ts=0,
       ),
       Row(
           id=4,
           value="bar",
           ts=1,
       ),
   ]
   df = spark.createDataFrame(input_data)
   
   # Example Hudi configs
   hudi_options = {
       "hoodie.table.name": "fake_name",
       "hoodie.datasource.write.table.name": "fake_name",
       "hoodie.datasource.write.table.type": "COPY_ON_WRITE",
       "hoodie.datasource.write.hive_style_partitioning": "true",
       "hoodie.metadata.enable": "false",
       "hoodie.bootstrap.index.enable": "false",
       "hoodie.datasource.write.partitionpath.field": "",
       "hoodie.datasource.write.recordkey.field": "id",
       "hoodie.datasource.write.precombine.field": "ts",
       # Testing out bulk insert
       "hoodie.combine.before.insert": "true",
       "hoodie.datasource.write.operation": "bulk_insert",
   }
   
   PATH = "hdfs:///example_github
   df.write.format("hudi").options(**hudi_options).mode("overwrite").save(PATH)
   
   # Should be 1 but prints 2
   print(spark.read.format("hudi").load(PATH).count())
   # Both rows exist
   print(spark.read.format("hudi").load(PATH).collect())
   ```
   
   **Expected behavior**
   The following is surprising:
   - `bulk_insert` deduplicates properly if `hoodie.populate.meta.fields` is 
enabled
   - `bulk_insert` does not deduplicate if `hoodie.populate.meta.fields` is 
disabled
   - `insert` deduplicates properly regardless of `hoodie.populate.meta.fields`
   
   I think the user expectation is that `bulk_insert` has consistent/documented 
behavior regardless of `hoodie.populate.meta.fields`. Ideally we would like 
`hoodie.combine.before.insert` to work.
   
   **Environment Description**
   
   This runs on 
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-6101-release.html
   
   * Hudi version : 0.12.2
   
   * Spark version : 3.3.1
   
   * Hive version : 3.1.3
   
   * Hadoop version : 3.3.3
   
   * Storage (HDFS/S3/GCS..) : HDFS/S3 both seem affected
   
   * Running on Docker? (yes/no) : Yes
   
   
   **Additional context**
   
   We would love to update to a new version of Hudi but there are serious 
blocking bugs with key generators that are still open:
   
   - https://github.com/apache/hudi/issues/10508 ("The ComplexKeyGenerator does 
not produce the same result for 0.14.1 than previous versions.")
   - https://github.com/apache/hudi/issues/8372 (CustomKeyGenerator does not 
work with delete partitions operation)
   
   Is there a way to workaround this on our current version of Hudi?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT] hoodie.combine.before.insert silently broken for bulk_insert if meta fields disabled (causes duplicates) [hudi]

Reply via email to