yihua commented on issue #9615: URL: https://github.com/apache/hudi/issues/9615#issuecomment-1708838256
I've encountered duplicate parquet files too for regular commits when Spark's speculative execution is turned on. This has caused the mismatch between commit metadata / MDT file listing and the data files on FS. The root cause is same that some duplicate data files are not deleted on the file system, when the spark speculative execution is turned on leading to Spark retries and killed tasks in the delete stage based on markers. @beyond1920 @KnightChess could you share the screenshots of the jobs in Spark UI when the issue happens? I'm trying to understand why marker-based deletion of duplicate parquet files does not work. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
