RajasekarSribalan opened a new issue #2214:
URL: https://github.com/apache/hudi/issues/2214
We are having a Hudi spark pipeline which constantly does upsert on a Hudi
table. Incoming traffic is 5k records per sec on the table. We use COW table
type but after upsert we could see lot of duplicate rows for same record key.
We do set the precombine field which is date string field. Upsert should always
update the record but it creates a duplicate entry. Pls note, we might get
duplicate records in the incoming messages so dataframe will have duplicate
records.
Also , we query from Spark SQL and we set the properties/config according
to Hudi doc.
****Version details****
Table type : COW
Operation : Upsert
Hudi - 0.5.2-incubating
Spark - 2.2.0
@vinothchandar @bvaradar @bhasudha Please assist!. We thought of running
repair deduplicate form Hudi Cli but seems like it only support for partitioned
tables but our table is non-partitioned table.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]