Jing Zhang created HUDI-7867:
--------------------------------

             Summary: Data deduplication caused by drawback in the delete 
invalid files before commit
                 Key: HUDI-7867
                 URL: https://issues.apache.org/jira/browse/HUDI-7867
             Project: Apache Hudi
          Issue Type: Bug
          Components: core
            Reporter: Jing Zhang


Our user complained that after their daily run job which written to a Hudi cow 
table finished, the downstream reading jobs find many duplicate records today. 
The daily run job has been already online for a long time, and this is the 
first time of such wrong result.
He gives a detailed deduplicated record as example to help debug. The record 
appeared in 3 base files which belongs to different file groups.
[!https://private-user-images.githubusercontent.com/1525333/337907952-60b95dc4-91d6-4b40-8bca-c877a4407ae0.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgxOTk5ODEsIm5iZiI6MTcxODE5OTY4MSwicGF0aCI6Ii8xNTI1MzMzLzMzNzkwNzk1Mi02MGI5NWRjNC05MWQ2LTRiNDAtOGJjYS1jODc3YTQ0MDdhZTAucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTJUMTM0MTIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZTZhMThjZDdiNjNmYjYyZmU5Mjg3OWIyMTg5ZTFkNDBmMTc5NjliZjFjMjQwZWQwM2JjZjMxNDU4ZDA3NzZhZSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.ueqsTezXNbtnxyqSyzW2_v92Jc0z_7ioljutPcfcWwE|width=491!|https://private-user-images.githubusercontent.com/1525333/337907952-60b95dc4-91d6-4b40-8bca-c877a4407ae0.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgxOTk5ODEsIm5iZiI6MTcxODE5OTY4MSwicGF0aCI6Ii8xNTI1MzMzLzMzNzkwNzk1Mi02MGI5NWRjNC05MWQ2LTRiNDAtOGJjYS1jODc3YTQ0MDdhZTAucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTJUMTM0MTIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZTZhMThjZDdiNjNmYjYyZmU5Mjg3OWIyMTg5ZTFkNDBmMTc5NjliZjFjMjQwZWQwM2JjZjMxNDU4ZDA3NzZhZSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.ueqsTezXNbtnxyqSyzW2_v92Jc0z_7ioljutPcfcWwE]
I find the today's writer job, the spark application finished successfully.
In the driver log, I find those two files marked as invalid files which to 
delete, only one file is valid files.
[!https://private-user-images.githubusercontent.com/1525333/337909363-8e19e170-e38f-4725-82a5-84ed55750db9.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgxOTk5ODEsIm5iZiI6MTcxODE5OTY4MSwicGF0aCI6Ii8xNTI1MzMzLzMzNzkwOTM2My04ZTE5ZTE3MC1lMzhmLTQ3MjUtODJhNS04NGVkNTU3NTBkYjkucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTJUMTM0MTIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NzUwMGQ4ODU2NDNmODFiYmE2YjA0OGIzMzBhZGU4OGMxOGYxMTNkZTJjNzZjZDI0N2YwNDRmMWMwY2ZiNWQzOSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.0RruG5y4012v6dHdoqmEEMTT2oLWjmIHQsa_JHl-vmg|width=1380!|https://private-user-images.githubusercontent.com/1525333/337909363-8e19e170-e38f-4725-82a5-84ed55750db9.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgxOTk5ODEsIm5iZiI6MTcxODE5OTY4MSwicGF0aCI6Ii8xNTI1MzMzLzMzNzkwOTM2My04ZTE5ZTE3MC1lMzhmLTQ3MjUtODJhNS04NGVkNTU3NTBkYjkucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTJUMTM0MTIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NzUwMGQ4ODU2NDNmODFiYmE2YjA0OGIzMzBhZGU4OGMxOGYxMTNkZTJjNzZjZDI0N2YwNDRmMWMwY2ZiNWQzOSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.0RruG5y4012v6dHdoqmEEMTT2oLWjmIHQsa_JHl-vmg]
And in the clean stage task log, those two files are also marked to be deleted 
and there is no exception in the task either.
[!https://private-user-images.githubusercontent.com/1525333/337911404-1a819bd0-2dbe-4236-a0ed-e5f4576cfa38.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgxOTk5ODEsIm5iZiI6MTcxODE5OTY4MSwicGF0aCI6Ii8xNTI1MzMzLzMzNzkxMTQwNC0xYTgxOWJkMC0yZGJlLTQyMzYtYTBlZC1lNWY0NTc2Y2ZhMzgucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTJUMTM0MTIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NDI2ZGZkZTBkMTE1MmE1NDAzMWE3MzAzOGUwMWVmMjA0NjZmMDMyZjhhYTlmMmJlOWFiOTI3NzJlMWMzMmExNiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.7l4mIJmEAJ5m0Ly9cM0z-lcMaHYEVqcfDKZ8piJZTwg|width=1099!|https://private-user-images.githubusercontent.com/1525333/337911404-1a819bd0-2dbe-4236-a0ed-e5f4576cfa38.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgxOTk5ODEsIm5iZiI6MTcxODE5OTY4MSwicGF0aCI6Ii8xNTI1MzMzLzMzNzkxMTQwNC0xYTgxOWJkMC0yZGJlLTQyMzYtYTBlZC1lNWY0NTc2Y2ZhMzgucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTJUMTM0MTIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NDI2ZGZkZTBkMTE1MmE1NDAzMWE3MzAzOGUwMWVmMjA0NjZmMDMyZjhhYTlmMmJlOWFiOTI3NzJlMWMzMmExNiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.7l4mIJmEAJ5m0Ly9cM0z-lcMaHYEVqcfDKZ8piJZTwg]
Those two files already existed on the hdfs before the clean stage began, but 
they still existed after the clean stage.

Finally, found the root cause is some corner case happened in hdfs. And 
{{fs.delete}} does not throw any exception, only return {{false}} if the hdfs 
does not delete the file successfully.
[!https://private-user-images.githubusercontent.com/1525333/337913364-4a1f46d8-0b6b-4089-bed1-7d6a2e72ac28.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgxOTk5ODEsIm5iZiI6MTcxODE5OTY4MSwicGF0aCI6Ii8xNTI1MzMzLzMzNzkxMzM2NC00YTFmNDZkOC0wYjZiLTQwODktYmVkMS03ZDZhMmU3MmFjMjgucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTJUMTM0MTIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9MWZmMDY0Y2E0NDkwODMwMjdkMzhhNjczNWE5MDY0MjFmMDllMWUzNmUxMTIzM2NiMmJhNDEyMjk0ZTA0YjM1NSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.wZsOPdkglkwGAGnirsQPfRNO6YL31IQOI-hefEpyP4w|width=1296!|https://private-user-images.githubusercontent.com/1525333/337913364-4a1f46d8-0b6b-4089-bed1-7d6a2e72ac28.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgxOTk5ODEsIm5iZiI6MTcxODE5OTY4MSwicGF0aCI6Ii8xNTI1MzMzLzMzNzkxMzM2NC00YTFmNDZkOC0wYjZiLTQwODktYmVkMS03ZDZhMmU3MmFjMjgucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTJUMTM0MTIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9MWZmMDY0Y2E0NDkwODMwMjdkMzhhNjczNWE5MDY0MjFmMDllMWUzNmUxMTIzM2NiMmJhNDEyMjk0ZTA0YjM1NSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.wZsOPdkglkwGAGnirsQPfRNO6YL31IQOI-hefEpyP4w]
And I check the {{fs.delete}} api, the definition is reasonable.
[!https://private-user-images.githubusercontent.com/1525333/337914721-20b7e237-18d4-480a-aedc-6c5a57b24062.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgxOTk5ODEsIm5iZiI6MTcxODE5OTY4MSwicGF0aCI6Ii8xNTI1MzMzLzMzNzkxNDcyMS0yMGI3ZTIzNy0xOGQ0LTQ4MGEtYWVkYy02YzVhNTdiMjQwNjIucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTJUMTM0MTIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9Mzk3ZWFiZDA5YjIyYmJlZmJhZTFhOWU5MDRmNmM4MjA0Y2E5YTc2ODZmY2JhNDJlMjkyZTE3ODk0MThmNmYxMiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.lc6SVIoKOga5F0LMrHY45A3GjsrCE1LecTSg5ruwHzE|width=890!|https://private-user-images.githubusercontent.com/1525333/337914721-20b7e237-18d4-480a-aedc-6c5a57b24062.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgxOTk5ODEsIm5iZiI6MTcxODE5OTY4MSwicGF0aCI6Ii8xNTI1MzMzLzMzNzkxNDcyMS0yMGI3ZTIzNy0xOGQ0LTQ4MGEtYWVkYy02YzVhNTdiMjQwNjIucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTJUMTM0MTIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9Mzk3ZWFiZDA5YjIyYmJlZmJhZTFhOWU5MDRmNmM4MjA0Y2E5YTc2ODZmY2JhNDJlMjkyZTE3ODk0MThmNmYxMiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.lc6SVIoKOga5F0LMrHY45A3GjsrCE1LecTSg5ruwHzE]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to