Prashant Wason created HUDI-2925:
------------------------------------
Summary: Cleaner may attempt to delete the same file twice when
metadata table is enabled
Key: HUDI-2925
URL: https://issues.apache.org/jira/browse/HUDI-2925
Project: Apache Hudi
Issue Type: Bug
Reporter: Prashant Wason
Assignee: Prashant Wason
Fix For: 0.10.0
This issue happens only when TimelineServer is disabled (reason in next
comment). Our pipelines execute a write (insert or upsert) along with an
asynchronous clean. Metadata table is enabled.
Assume the timelines are as follows:
Dataset: 100.commit 101.commit 102.clean.inflight
Metadata: 100.deltacomit
(this happened as the pipeline failed due to non-HUDI issues which executing
101 and 102)
In the next run of the pipeline some more data is available so a commit will
take place (103.commit.requested). Along with it, an asynchronous clean starts
(104.clean.requested). The [BaseCleanActionExecutor detected previously
unfinished
clean|https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java#L231]
(102.clean.inflight) and attempts to do it first. So the order of cleans will
be 102.clean followed by 104.clean.
102.clean => Suppose this deletes files from 90.commit
104.clean => This should delete files from 91.commit
The issue is that while executing 104.clean, the filesystemview is still the
one which was used during 102.clean (i.e. post clean the file system view is
not synced). When metadata table is enabled, HoodieMetadataFileSystemView is
used which has the metadata reader inside it. This metadata reader opens the
metadata table at a particular time instant (will be 101.commit as that was the
last completed action). Even after 102.clean is completed, the
HoodieMetadataFileSystemView is still using the cached metadata reader. Hence,
the reader still returns files from 90.commit which have already been deleted
by 102.clean.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)