ad1happy2go commented on issue #6166:
URL: https://github.com/apache/hudi/issues/6166#issuecomment-1534596219
Able to reproduce this. Actually MOR table supports upserts. But we are
seeing dupes.
```
# Publish first 100 records
bash setupKafka.sh -n 100 -k test1
Before Compaction - 100
After Compaction - 100
# Publish next 100 records all with new keys
bash setupKafka.sh -n 100 -t -o 100 -k test1
Before Compaction - 200
After Compaction - 200
# Publish next 100 records 50 upsets and 50 new keys
bash setupKafka.sh -n 100 -t -o 150 -k test1
Before Compaction - 285 [ Distinct count of
("_hoodie_partition_path","volume") = 250)
After Compaction - 285
So, dupes coming after this step.
bash setupKafka.sh -n 100 -t -o 180 -k test1
Before Compaction - 369 Distinct count = 280
After Compaction - 369 Distinct count = 280
```
connect properties used
```
{
"name": "test1",
"config": {
"bootstrap.servers": "kafkabroker:9092",
"connector.class":
"org.apache.hudi.connect.HoodieSinkConnector",
"tasks.max": "4",
"key.converter":
"org.apache.kafka.connect.storage.StringConverter",
"value.converter":
"org.apache.kafka.connect.storage.StringConverter",
"value.converter.schemas.enable": "false",
"topics": "test1",
"hoodie.table.name": "test1",
"hoodie.table.type": "MERGE_ON_READ",
"hoodie.base.path": "file:///tmp/hoodie/test1",
"hoodie.datasource.write.recordkey.field": "volume",
"hoodie.datasource.write.partitionpath.field": "date",
"hoodie.schemaprovider.class":
"org.apache.hudi.schema.SchemaRegistryProvider",
"hoodie.deltastreamer.schemaprovider.registry.url":
"http://localhost:8082/subjects/test1/versions/latest",
"hoodie.kafka.commit.interval.secs": 60,
"hoodie.compact.schedule.inline":"true",
"hoodie.compact.inline.max.delta.commits":1
}
}
```
Also, I noticed the data loss with async clustering enabled.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]