ad1happy2go commented on issue #6166:
URL: https://github.com/apache/hudi/issues/6166#issuecomment-1534596219

   Able to reproduce this. Actually MOR table supports upserts. But we are 
seeing dupes.
   
   ```
   # Publish first 100 records
   bash setupKafka.sh -n 100 -k test1
       Before Compaction - 100
       After Compaction - 100
   
   # Publish next  100 records all with new keys
   bash setupKafka.sh -n 100 -t -o 100 -k test1
       Before Compaction - 200
       After Compaction - 200
   
   # Publish next  100 records 50 upsets and 50 new keys
   bash setupKafka.sh -n 100 -t -o 150 -k test1
       Before Compaction - 285 [ Distinct count of 
("_hoodie_partition_path","volume") = 250)
       After Compaction - 285
       So, dupes coming after this step.    
   
   
   bash setupKafka.sh -n 100 -t -o 180 -k test1
       Before Compaction - 369 Distinct count = 280
       After Compaction - 369 Distinct count = 280
   ```
   connect properties used 
   ```
   {
       "name": "test1",
       "config": {
                   "bootstrap.servers": "kafkabroker:9092",
                   "connector.class": 
"org.apache.hudi.connect.HoodieSinkConnector",
                   "tasks.max": "4",
                   "key.converter": 
"org.apache.kafka.connect.storage.StringConverter",
                   "value.converter": 
"org.apache.kafka.connect.storage.StringConverter",
                   "value.converter.schemas.enable": "false",
                   "topics": "test1",
                   "hoodie.table.name": "test1",
                   "hoodie.table.type": "MERGE_ON_READ",
                   "hoodie.base.path": "file:///tmp/hoodie/test1",
                   "hoodie.datasource.write.recordkey.field": "volume",
                   "hoodie.datasource.write.partitionpath.field": "date",
                   "hoodie.schemaprovider.class": 
"org.apache.hudi.schema.SchemaRegistryProvider",
                   "hoodie.deltastreamer.schemaprovider.registry.url": 
"http://localhost:8082/subjects/test1/versions/latest";,
                   "hoodie.kafka.commit.interval.secs": 60,
                   "hoodie.compact.schedule.inline":"true",
                   "hoodie.compact.inline.max.delta.commits":1
         }
   }
   ```
   
   Also, I noticed the data loss with async clustering enabled.
    
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to