yihua commented on issue #9615:
URL: https://github.com/apache/hudi/issues/9615#issuecomment-1708838256

   I've encountered duplicate parquet files too for regular commits when 
Spark's speculative execution is turned on.  This has caused the mismatch 
between commit metadata / MDT file listing and the data files on FS.  The root 
cause is same that some duplicate data files are not deleted on the file 
system, when the spark speculative execution is turned on leading to Spark 
retries and killed tasks in the delete stage based on markers.
   
   @beyond1920 @KnightChess could you share the screenshots of the jobs in 
Spark UI when the issue happens?  I'm trying to understand why marker-based 
deletion of duplicate parquet files does not work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to