RajasekarSribalan opened a new issue #2214:
URL: https://github.com/apache/hudi/issues/2214


   We are having a Hudi spark pipeline which constantly  does upsert on a Hudi 
table. Incoming traffic is 5k records per sec on the table. We use COW table 
type but after upsert we could see lot of duplicate rows for same record key. 
We do set the precombine field which is date string field. Upsert should always 
update the record but it creates a duplicate entry. Pls note, we might get 
duplicate records in the incoming messages so dataframe will have duplicate 
records.
    Also , we query from Spark SQL and we set the properties/config according 
to Hudi doc.
   
   ****Version details****
   
   Table type : COW
   Operation : Upsert
   Hudi - 0.5.2-incubating
   Spark - 2.2.0
   
   @vinothchandar  @bvaradar @bhasudha  Please assist!. We thought of running 
repair deduplicate form Hudi Cli but seems like it only support for partitioned 
tables but our table is non-partitioned table.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to