xushiyan commented on issue #5857: URL: https://github.com/apache/hudi/issues/5857#issuecomment-1296323395
> More clues for data duplication issue: I noticed two exactly the same records, one in avro log file, the other in merged parquet file after spark insert. Spark and flink writers diverge when writing to MOR: spark writes updates to log files and inserts (new records) to base files directly, while flink writes all updates and inserts to log files. So in your pipeline setup, it is possible for flink writer writes all new data in logs and before compaction kicks in, spark writes some overlapping data in base files. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
