Nishith Agarwal created HUDI-1800:
-------------------------------------
Summary: Incorrect HoodieTableFileSystem API usage for pending
slices causing issues
Key: HUDI-1800
URL: https://issues.apache.org/jira/browse/HUDI-1800
Project: Apache Hudi
Issue Type: Bug
Components: Writer Core
Reporter: Nishith Agarwal
>From [~vbalaji]
We are using wrong API of FileSystemView here
[https://github.com/apache/hudi/blob/release-0.6.0/hudi-client/src/main/java/org/apache/hudi/table/action/deltacommit/UpsertDeltaCommitPartitioner.java#L85]
We don't include file groups that are in pending compaction but with Hbase
Index we are including them. With the current state of code, Including files in
pending compaction is an issue.
This API "getLatestFileSlicesBeforeOrOn" is originally intended to be used by
CompactionAdminClient to figure out log files that were added after pending
compaction and rename them such that we can undo the effects of compaction
scheduling. There is a different API "getLatestMergedFileSlicesBeforeOrOn"
which gives a consolidated view of the latest file slice and includes all data
both before and after compaction. This is what should be used in
[https://github.com/apache/hudi/blob/release-0.6.0/hudi-client/src/main/java/org/apache/hudi/table/action/deltacommit/UpsertDeltaCommitPartitioner.java#L85]
The other workaround would be excluding file slices in pending compaction when
we select small files to avoid the interaction between compactor and ingestion
in this case. But, I think we can go with the first option
More details can be found here -> https://github.com/apache/hudi/issues/2633
--
This message was sent by Atlassian Jira
(v8.3.4#803005)