beyond1920 opened a new issue, #11419: URL: https://github.com/apache/hudi/issues/11419
Dear community, Our user complained that after their daily run job which written to a Hudi cow table finished, the downstream reading jobs find many duplicate records today. The daily run job has been already online for a long time, and this is the first time of such wrong result. He gives a detailed deduplicated record as example to help debug. The record appeared in 3 base files which belongs to different file groups. <img width="491" alt="image" src="https://github.com/apache/hudi/assets/1525333/60b95dc4-91d6-4b40-8bca-c877a4407ae0"> I find the today's writer job, the spark application finished successfully. In the driver log, I find those two files marked as invalid files which to delete, only one file is valid files. <img width="1380" alt="image" src="https://github.com/apache/hudi/assets/1525333/8e19e170-e38f-4725-82a5-84ed55750db9"> And in the clean stage task log, those two files are also marked to be deleted and there is no exception in the task either. <img width="1099" alt="image" src="https://github.com/apache/hudi/assets/1525333/1a819bd0-2dbe-4236-a0ed-e5f4576cfa38"> Those two files already existed on the hdfs before the clean stage began, but they still existed after the clean stage. Finally, found the root cause is some corner case happened in hdfs. And `fs.delete` does not throw any exception, only return `false` if the hdfs does not delete the file successfully. <img width="1296" alt="image" src="https://github.com/apache/hudi/assets/1525333/4a1f46d8-0b6b-4089-bed1-7d6a2e72ac28"> And I check the `fs.delete` api, the behavior is reasonable. <img width="890" alt="image" src="https://github.com/apache/hudi/assets/1525333/20b7e237-18d4-480a-aedc-6c5a57b24062"> I think we should check the return value of`fs.delete` in `HoodieTable#deleteInvalidFilesByPartitions` to avoid wrong results. Besides, it's necessary to check all places which called `fs.delete`. Any suggestion? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
