whight commented on issue #8451:
URL: https://github.com/apache/hudi/issues/8451#issuecomment-1511137106

   @ad1happy2go 
   I found the setting's default value is false, but there are still a number 
of duplicate rows in the table,  it is a BUG?
   
   I viewed the FAQ page, the part of "How does Hudi handle duplicate record 
keys in an input"  as follows:
   
   > For an insert or bulk_insert operation, no such pre-combining is 
performed. Thus, if your input contains duplicates, the dataset would also 
contain duplicates. If you don't want duplicate records either issue an upsert 
or consider specifying option to de-duplicate input in either 
[datasource](https://hudi.apache.org/docs/configurations.html#INSERT_DROP_DUPS_OPT_KEY)
 or 
[deltastreamer](https://github.com/apache/hudi/blob/d3edac4612bde2fa9deca9536801dbc48961fb95/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java#L229).
   
   I suggest add the setting tips in this part.
   
   
   > @whight can you try setting hoodie.merge.allow.duplicate.on.inserts as 
true. I was able to reproduce your error and it got fixed with above setting.
   > 
   > Else you can also use Bulk insert which is fast but will not do small file 
handling. You can schedule separate clustering job for the same.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to