Jing Zhang created HUDI-7867:
--------------------------------
Summary: Data deduplication caused by drawback in the delete
invalid files before commit
Key: HUDI-7867
URL: https://issues.apache.org/jira/browse/HUDI-7867
Project: Apache Hudi
Issue Type: Bug
Components: core
Reporter: Jing Zhang
Our user complained that after their daily run job which written to a Hudi cow
table finished, the downstream reading jobs find many duplicate records today.
The daily run job has been already online for a long time, and this is the
first time of such wrong result.
He gives a detailed deduplicated record as example to help debug. The record
appeared in 3 base files which belongs to different file groups.
[!https://private-user-images.githubusercontent.com/1525333/337907952-60b95dc4-91d6-4b40-8bca-c877a4407ae0.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgxOTk5ODEsIm5iZiI6MTcxODE5OTY4MSwicGF0aCI6Ii8xNTI1MzMzLzMzNzkwNzk1Mi02MGI5NWRjNC05MWQ2LTRiNDAtOGJjYS1jODc3YTQ0MDdhZTAucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTJUMTM0MTIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZTZhMThjZDdiNjNmYjYyZmU5Mjg3OWIyMTg5ZTFkNDBmMTc5NjliZjFjMjQwZWQwM2JjZjMxNDU4ZDA3NzZhZSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.ueqsTezXNbtnxyqSyzW2_v92Jc0z_7ioljutPcfcWwE|width=491!|https://private-user-images.githubusercontent.com/1525333/337907952-60b95dc4-91d6-4b40-8bca-c877a4407ae0.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgxOTk5ODEsIm5iZiI6MTcxODE5OTY4MSwicGF0aCI6Ii8xNTI1MzMzLzMzNzkwNzk1Mi02MGI5NWRjNC05MWQ2LTRiNDAtOGJjYS1jODc3YTQ0MDdhZTAucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTJUMTM0MTIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZTZhMThjZDdiNjNmYjYyZmU5Mjg3OWIyMTg5ZTFkNDBmMTc5NjliZjFjMjQwZWQwM2JjZjMxNDU4ZDA3NzZhZSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.ueqsTezXNbtnxyqSyzW2_v92Jc0z_7ioljutPcfcWwE]
I find the today's writer job, the spark application finished successfully.
In the driver log, I find those two files marked as invalid files which to
delete, only one file is valid files.
[!https://private-user-images.githubusercontent.com/1525333/337909363-8e19e170-e38f-4725-82a5-84ed55750db9.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgxOTk5ODEsIm5iZiI6MTcxODE5OTY4MSwicGF0aCI6Ii8xNTI1MzMzLzMzNzkwOTM2My04ZTE5ZTE3MC1lMzhmLTQ3MjUtODJhNS04NGVkNTU3NTBkYjkucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTJUMTM0MTIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NzUwMGQ4ODU2NDNmODFiYmE2YjA0OGIzMzBhZGU4OGMxOGYxMTNkZTJjNzZjZDI0N2YwNDRmMWMwY2ZiNWQzOSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.0RruG5y4012v6dHdoqmEEMTT2oLWjmIHQsa_JHl-vmg|width=1380!|https://private-user-images.githubusercontent.com/1525333/337909363-8e19e170-e38f-4725-82a5-84ed55750db9.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgxOTk5ODEsIm5iZiI6MTcxODE5OTY4MSwicGF0aCI6Ii8xNTI1MzMzLzMzNzkwOTM2My04ZTE5ZTE3MC1lMzhmLTQ3MjUtODJhNS04NGVkNTU3NTBkYjkucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTJUMTM0MTIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NzUwMGQ4ODU2NDNmODFiYmE2YjA0OGIzMzBhZGU4OGMxOGYxMTNkZTJjNzZjZDI0N2YwNDRmMWMwY2ZiNWQzOSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.0RruG5y4012v6dHdoqmEEMTT2oLWjmIHQsa_JHl-vmg]
And in the clean stage task log, those two files are also marked to be deleted
and there is no exception in the task either.
[!https://private-user-images.githubusercontent.com/1525333/337911404-1a819bd0-2dbe-4236-a0ed-e5f4576cfa38.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgxOTk5ODEsIm5iZiI6MTcxODE5OTY4MSwicGF0aCI6Ii8xNTI1MzMzLzMzNzkxMTQwNC0xYTgxOWJkMC0yZGJlLTQyMzYtYTBlZC1lNWY0NTc2Y2ZhMzgucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTJUMTM0MTIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NDI2ZGZkZTBkMTE1MmE1NDAzMWE3MzAzOGUwMWVmMjA0NjZmMDMyZjhhYTlmMmJlOWFiOTI3NzJlMWMzMmExNiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.7l4mIJmEAJ5m0Ly9cM0z-lcMaHYEVqcfDKZ8piJZTwg|width=1099!|https://private-user-images.githubusercontent.com/1525333/337911404-1a819bd0-2dbe-4236-a0ed-e5f4576cfa38.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgxOTk5ODEsIm5iZiI6MTcxODE5OTY4MSwicGF0aCI6Ii8xNTI1MzMzLzMzNzkxMTQwNC0xYTgxOWJkMC0yZGJlLTQyMzYtYTBlZC1lNWY0NTc2Y2ZhMzgucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTJUMTM0MTIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NDI2ZGZkZTBkMTE1MmE1NDAzMWE3MzAzOGUwMWVmMjA0NjZmMDMyZjhhYTlmMmJlOWFiOTI3NzJlMWMzMmExNiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.7l4mIJmEAJ5m0Ly9cM0z-lcMaHYEVqcfDKZ8piJZTwg]
Those two files already existed on the hdfs before the clean stage began, but
they still existed after the clean stage.
Finally, found the root cause is some corner case happened in hdfs. And
{{fs.delete}} does not throw any exception, only return {{false}} if the hdfs
does not delete the file successfully.
[!https://private-user-images.githubusercontent.com/1525333/337913364-4a1f46d8-0b6b-4089-bed1-7d6a2e72ac28.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgxOTk5ODEsIm5iZiI6MTcxODE5OTY4MSwicGF0aCI6Ii8xNTI1MzMzLzMzNzkxMzM2NC00YTFmNDZkOC0wYjZiLTQwODktYmVkMS03ZDZhMmU3MmFjMjgucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTJUMTM0MTIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9MWZmMDY0Y2E0NDkwODMwMjdkMzhhNjczNWE5MDY0MjFmMDllMWUzNmUxMTIzM2NiMmJhNDEyMjk0ZTA0YjM1NSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.wZsOPdkglkwGAGnirsQPfRNO6YL31IQOI-hefEpyP4w|width=1296!|https://private-user-images.githubusercontent.com/1525333/337913364-4a1f46d8-0b6b-4089-bed1-7d6a2e72ac28.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgxOTk5ODEsIm5iZiI6MTcxODE5OTY4MSwicGF0aCI6Ii8xNTI1MzMzLzMzNzkxMzM2NC00YTFmNDZkOC0wYjZiLTQwODktYmVkMS03ZDZhMmU3MmFjMjgucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTJUMTM0MTIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9MWZmMDY0Y2E0NDkwODMwMjdkMzhhNjczNWE5MDY0MjFmMDllMWUzNmUxMTIzM2NiMmJhNDEyMjk0ZTA0YjM1NSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.wZsOPdkglkwGAGnirsQPfRNO6YL31IQOI-hefEpyP4w]
And I check the {{fs.delete}} api, the definition is reasonable.
[!https://private-user-images.githubusercontent.com/1525333/337914721-20b7e237-18d4-480a-aedc-6c5a57b24062.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgxOTk5ODEsIm5iZiI6MTcxODE5OTY4MSwicGF0aCI6Ii8xNTI1MzMzLzMzNzkxNDcyMS0yMGI3ZTIzNy0xOGQ0LTQ4MGEtYWVkYy02YzVhNTdiMjQwNjIucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTJUMTM0MTIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9Mzk3ZWFiZDA5YjIyYmJlZmJhZTFhOWU5MDRmNmM4MjA0Y2E5YTc2ODZmY2JhNDJlMjkyZTE3ODk0MThmNmYxMiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.lc6SVIoKOga5F0LMrHY45A3GjsrCE1LecTSg5ruwHzE|width=890!|https://private-user-images.githubusercontent.com/1525333/337914721-20b7e237-18d4-480a-aedc-6c5a57b24062.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgxOTk5ODEsIm5iZiI6MTcxODE5OTY4MSwicGF0aCI6Ii8xNTI1MzMzLzMzNzkxNDcyMS0yMGI3ZTIzNy0xOGQ0LTQ4MGEtYWVkYy02YzVhNTdiMjQwNjIucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MTJUMTM0MTIxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9Mzk3ZWFiZDA5YjIyYmJlZmJhZTFhOWU5MDRmNmM4MjA0Y2E5YTc2ODZmY2JhNDJlMjkyZTE3ODk0MThmNmYxMiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.lc6SVIoKOga5F0LMrHY45A3GjsrCE1LecTSg5ruwHzE]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)