Nishith Agarwal created HUDI-1800:
-------------------------------------

             Summary: Incorrect HoodieTableFileSystem API usage for pending 
slices causing issues
                 Key: HUDI-1800
                 URL: https://issues.apache.org/jira/browse/HUDI-1800
             Project: Apache Hudi
          Issue Type: Bug
          Components: Writer Core
            Reporter: Nishith Agarwal


>From [~vbalaji]

 

We are using wrong API of FileSystemView here

[https://github.com/apache/hudi/blob/release-0.6.0/hudi-client/src/main/java/org/apache/hudi/table/action/deltacommit/UpsertDeltaCommitPartitioner.java#L85]

We don't include file groups that are in pending compaction but with Hbase 
Index we are including them. With the current state of code, Including files in 
pending compaction is an issue.

This API "getLatestFileSlicesBeforeOrOn" is originally intended to be used by 
CompactionAdminClient to figure out log files that were added after pending 
compaction and rename them such that we can undo the effects of compaction 
scheduling. There is a different API "getLatestMergedFileSlicesBeforeOrOn" 
which gives a consolidated view of the latest file slice and includes all data 
both before and after compaction. This is what should be used in

[https://github.com/apache/hudi/blob/release-0.6.0/hudi-client/src/main/java/org/apache/hudi/table/action/deltacommit/UpsertDeltaCommitPartitioner.java#L85]

The other workaround would be excluding file slices in pending compaction when 
we select small files to avoid the interaction between compactor and ingestion 
in this case. But, I think we can go with the first option

 

More details can be found here -> https://github.com/apache/hudi/issues/2633



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to