mzheng-plaid opened a new issue, #11044:
URL: https://github.com/apache/hudi/issues/11044
**Describe the problem you faced**
`hoodie.combine.before.insert` works with `bulk_insert` if the meta fields
are enabled but silently does not work and causes duplicates if disabled (ie.
`"hoodie.populate.meta.fields": "false"`).
**To Reproduce**
I provided a trivial reproduction below (`hoodie.populate.meta.fields` seems
like only option that matters on whether bug happens):
```
# Generate dummy data
from pyspark.sql import Row
input_data = [
Row(
id=4,
value="foo",
ts=0,
),
Row(
id=4,
value="bar",
ts=1,
),
]
df = spark.createDataFrame(input_data)
# Example Hudi configs
hudi_options = {
"hoodie.table.name": "fake_name",
"hoodie.datasource.write.table.name": "fake_name",
"hoodie.datasource.write.table.type": "COPY_ON_WRITE",
"hoodie.datasource.write.hive_style_partitioning": "true",
"hoodie.metadata.enable": "false",
"hoodie.bootstrap.index.enable": "false",
"hoodie.datasource.write.partitionpath.field": "",
"hoodie.datasource.write.recordkey.field": "id",
"hoodie.datasource.write.precombine.field": "ts",
# Testing out bulk insert
"hoodie.combine.before.insert": "true",
"hoodie.datasource.write.operation": "bulk_insert",
}
PATH = "hdfs:///example_github
df.write.format("hudi").options(**hudi_options).mode("overwrite").save(PATH)
# Should be 1 but prints 2
print(spark.read.format("hudi").load(PATH).count())
# Both rows exist
print(spark.read.format("hudi").load(PATH).collect())
```
**Expected behavior**
The following is surprising:
- `bulk_insert` deduplicates properly if `hoodie.populate.meta.fields` is
enabled
- `bulk_insert` does not deduplicate if `hoodie.populate.meta.fields` is
disabled
- `insert` deduplicates properly regardless of `hoodie.populate.meta.fields`
I think the user expectation is that `bulk_insert` has consistent/documented
behavior regardless of `hoodie.populate.meta.fields`. Ideally we would like
`hoodie.combine.before.insert` to work.
**Environment Description**
This runs on
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-6101-release.html
* Hudi version : 0.12.2
* Spark version : 3.3.1
* Hive version : 3.1.3
* Hadoop version : 3.3.3
* Storage (HDFS/S3/GCS..) : HDFS/S3 both seem affected
* Running on Docker? (yes/no) : Yes
**Additional context**
We would love to update to a new version of Hudi but there are serious
blocking bugs with key generators that are still open:
- https://github.com/apache/hudi/issues/10508 ("The ComplexKeyGenerator does
not produce the same result for 0.14.1 than previous versions.")
- https://github.com/apache/hudi/issues/8372 (CustomKeyGenerator does not
work with delete partitions operation)
Is there a way to workaround this on our current version of Hudi?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]