[ 
https://issues.apache.org/jira/browse/HUDI-2925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452799#comment-17452799
 ] 

Prashant Wason commented on HUDI-2925:
--------------------------------------

This will lead to the following exception:

Caused by: org.apache.hudi.exception.HoodieMetadataException: Metadata record 
for partition 2021/12/01 is inconsistent: HoodieMetadataPayload 
{key=2021/12/01, type=2, 
creations=[3c8de6fb-72b1-4793-9210-c8b4abbe327e-0_0-13-18449_20211201224350.parquet,
 3c8de6fb-72b1-4793-9210-c8b4abbe327e-0_0-13-18451_20211201234421.parquet, 
3c8de6fb-72b1-4793-9210-c8b4abbe327e-0_0-13-18452_20211202001827.parquet, 
3c8de6fb-72b1-4793-9210-c8b4abbe327e-0_0-13-18452_20211202005031.parquet, 
3c8de6fb-72b1-4793-9210-c8b4abbe327e-0_1-13-18455_20211201232855.parquet], 
{*}deletions=[3c8de6fb-72b1-4793-9210-c8b4abbe327e-0_1-13-18453_20211201214355.parquet]{*},
 }

 

This is not an issue in 0.10 today as this exception is swallowed by default (a 
warning log is printed). Also, this has no data consistency issues but the 
metadata table will not match the contents on file system. 

> Cleaner may attempt to delete the same file twice when metadata table is 
> enabled
> --------------------------------------------------------------------------------
>
>                 Key: HUDI-2925
>                 URL: https://issues.apache.org/jira/browse/HUDI-2925
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Prashant Wason
>            Assignee: Prashant Wason
>            Priority: Blocker
>             Fix For: 0.10.0
>
>
> This issue happens only when TimelineServer is disabled (reason in next 
> comment). Our pipelines execute a write (insert or upsert) along with an 
> asynchronous clean. Metadata table is enabled.
>  
> Assume the timelines are as follows:
> Dataset:   100.commit        101.commit   102.clean.inflight
> Metadata: 100.deltacomit  
> (this happened as the pipeline failed due to non-HUDI  issues which executing 
> 101 and 102)
>  
> In the next run of the pipeline some more data is available  so a commit will 
> take place (103.commit.requested). Along with it, an asynchronous clean 
> starts (104.clean.requested). The [BaseCleanActionExecutor detected 
> previously unfinished 
> clean|https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java#L231]
>  (102.clean.inflight) and attempts to do it first. So the order of cleans 
> will be 102.clean followed by 104.clean.
>  
> 102.clean => Suppose this deletes files from 90.commit
> 104.clean  => This should delete files from 91.commit
>  
> The issue is that while executing 104.clean, the filesystemview is still the 
> one which was used during 102.clean (i.e. post clean the file system view is 
> not synced). When metadata table is enabled, HoodieMetadataFileSystemView is 
> used which has the metadata reader inside it. This metadata reader opens the 
> metadata table at a particular time instant (will be 101.commit as that was 
> the last completed action). Even after 102.clean is completed, the 
> HoodieMetadataFileSystemView is still using the cached metadata reader. 
> Hence, the reader still returns files from 90.commit which have already been 
> deleted by 102.clean.  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to