bvaradar edited a comment on issue #2633: URL: https://github.com/apache/hudi/issues/2633#issuecomment-810709098
@umehrot2 @n3nash @nsivabalan : My apologies. Sorry for the delay, I finally got chance to look into this . Yes, this will only manifest for case when index can support log files. I believe this is the problem: We are using wrong API of FileSystemView here https://github.com/apache/hudi/blob/release-0.6.0/hudi-client/src/main/java/org/apache/hudi/table/action/deltacommit/UpsertDeltaCommitPartitioner.java#L85 We don't include file groups that are in pending compaction but with Hbase Index we are including them. With the current state of code, Including files in pending compaction is an issue. This API "getLatestFileSlicesBeforeOrOn" is originally intended to be used by CompactionAdminClient to figure out log files that were added after pending compaction and rename them such that we can undo the effects of compaction scheduling. There is a different API "getLatestMergedFileSlicesBeforeOrOn" which gives a consolidated view of the latest file slice and includes all data both before and after compaction. This is what should be used in https://github.com/apache/hudi/blob/release-0.6.0/hudi-client/src/main/java/org/apache/hudi/table/action/deltacommit/UpsertDeltaCommitPartitioner.java#L85 The other workaround would be excluding file slices in pending compaction when we select small files to avoid the interaction between compactor and ingestion in this case. But, I think we can go with the first option -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
