xushiyan commented on issue #5857:
URL: https://github.com/apache/hudi/issues/5857#issuecomment-1296323395

   > More clues for data duplication issue: I noticed two exactly the same 
records, one in avro log file, the other in merged parquet file after spark 
insert.
   
   Spark and flink writers diverge when writing to MOR: spark writes updates to 
log files and inserts (new records) to base files directly, while flink writes 
all updates and inserts to log files. So in your pipeline setup, it is possible 
for flink writer writes all new data in logs and before compaction kicks in, 
spark writes some overlapping data in base files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to