voonhous commented on code in PR #7997:
URL: https://github.com/apache/hudi/pull/7997#discussion_r1113959105


##########
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieFileGroup.java:
##########
@@ -122,7 +129,24 @@ public HoodieFileGroupId getFileGroupId() {
    * some log files, that are based off a commit or delta commit.
    */
   private boolean isFileSliceCommitted(FileSlice slice) {
-    if (!compareTimestamps(slice.getBaseInstantTime(), LESSER_THAN_OR_EQUALS, 
lastInstant.get().getTimestamp())) {
+    if (compareTimestamps(slice.getBaseInstantTime(), GREATER_THAN, 
lastInstant.get().getTimestamp())) {
+      return false;
+    }
+
+    if (!slice.getBaseFile().isPresent() && 
timeline.isBeforeTimelineStarts(slice.getBaseInstantTime())) {

Review Comment:
   @bvaradar Reverted the change, please take a look. 
   
   Understood. The crux of the issue that I am trying to fix is somewhat 
similar to a multi-writer scenario. 
   
   In Flink, the job manager (JM) is responsible for performing a rollback, 
while the task manager (TM) is responsible for performing writes. 
   
   It is entirely possible and very common for TM to perform writes before a JM 
performs a rollback (see https://issues.apache.org/jira/browse/HUDI-5822) when 
a job is recovering from and is restarting.
   
   Under the bucket index use-case, a bucketId can only have 1 fileGroup. Using 
`getLatestFileSlices` when JM has yet to complete a rollback will cause the 
fileGroup that is pending rollback completion to not be visible to the TM.
   
   TM will hence generate a new fileGroup for the same bucketId. Once the 
rollback completes, a partition might have a bucketId that maps to two 
fileGroups, breaking the 1 bucketId <> 1 fileGroup mapping contract.
   
   As such, this was what #5185 was trying to fix. And allow fileGroups pending 
rollback to be re-used when performing a bucketIndex bootstrap. 
   
   I found that this fix was a tad hacky, which was why I tried modifying the 
lower level APIs to address such scenarios, which as can be seen, i shortly 
gave up due to the performance penalties that you have highlighted.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to