voonhous commented on PR #7891:
URL: https://github.com/apache/hudi/pull/7891#issuecomment-1457575576

   Apologies for disturbing everyone here. I've tried reaching out to the 
author on Slack, but failed to get an explanation, so i will post my questions 
here.
   
   # 1. Rational behind the fix?
   From what @SteNicholas shared, this bug fixes a data-duplication issue after 
clustering? Since this is not documented anywhere in the PR or the JR ticket, I 
think it is best to mention it in the PR itself for future references.
   
   # 2. Issues with integration test?
   I apologies again as I don't quite understand the reason behind this fix. 
Hence, while trying to reproduce this issue, I copied the integration test 
(`testHoodieFlinkClusteringScheduleAfterArchive`) onto a Hudi version's 
`ITTestHoodieFlinkClustering` that doesn't have this fix yet expecting the IT 
to fail. But it still succeeds without the said fix. Hence, the test might not 
be written properly, or the fix might not be valid to begin with as it passes 
under ANY conditions. 
   
   I am copying the fix onto 6b178ce978ce324361cf708ce711cd0c46452dd6, this is 
the commit prior to the fix being merged into master.
   
   I repeated the test 100 times and it's passing for all 100 times on a branch 
that's without the fix.
   
   
![image](https://user-images.githubusercontent.com/6312314/223329883-05e6a499-72f7-4950-b723-db6468a5e68e.png)
   
   
   # 3. Questions regarding the fix
   
   
   Can someone please help to address the concerns above, mainly (2) and (3).
   > i.e, timeline: deltacommit1, deltacommit2, replacecommit1.inflight, 
deltacommit3, deltacommit4', at this time, if deltacommit1, deltacommit2 are 
archived, the containsOrBeforeTimelineStarts returns true for 
replacecommit1.inflight because isBeforeTimelineStarts returns true. Then, the 
file slices of the pending clustering instant would be included in another 
clustering plan.
   
   IIUC, the (input) file slices of the pending clustering instant will be 
excluded in the second clustering plan as clustering plan generator excludes 
all file slices in pending:
   
   1. logCompaction
   2. compaction
   3. clustering
   
   Code handling this in master 
(org.apache.hudi.table.action.cluster.strategy.ClusteringPlanStrategy#getFileSlicesEligibleForClustering)
 :
   
   ```java
   /**
   * Return file slices eligible for clustering. FileIds in pending 
clustering/compaction are not eligible for clustering.
   */
   protected Stream<FileSlice> getFileSlicesEligibleForClustering(String 
partition) {
   SyncableFileSystemView fileSystemView = (SyncableFileSystemView) 
getHoodieTable().getSliceView();
   Set<HoodieFileGroupId> fgIdsInPendingCompactionLogCompactionAndClustering =
       Stream.concat(fileSystemView.getPendingCompactionOperations(), 
fileSystemView.getPendingLogCompactionOperations())
           .map(instantTimeOpPair -> 
instantTimeOpPair.getValue().getFileGroupId())
           .collect(Collectors.toSet());
   
fgIdsInPendingCompactionLogCompactionAndClustering.addAll(fileSystemView.getFileGroupsInPendingClustering().map(Pair::getKey).collect(Collectors.toSet()));
   
   return hoodieTable.getSliceView().getLatestFileSlices(partition)
       // file ids already in clustering are not eligible
       .filter(slice -> 
!fgIdsInPendingCompactionLogCompactionAndClustering.contains(slice.getFileGroupId()));
   }
   ```
   
   > When inline or async clustering is enabled, we need to ensure that there 
is a commit in the active timeline to check whether the file slice generated in 
pending clustering after archive isn't committed via 
HoodieFileGroup#isFileSliceCommitted(slice).
   
   IIUC, for a `fileSlice` to be a candidate for clustering inputFileGroup, it 
needs to be committed no? So, even though the the `replaceCommit` is the first 
instant on the timeline, the `inputFileGroups` should be tagged as 
HoodieFileGroup#isFileSliceCommitted(slice) = true.  
   
   # 4. My understanding of the fix
   
   From what i understand, (please correct me if i am wrong) this fix is trying 
to address an issue where the same fileslice is included in 2 separate 
clusteringPlan/replaceCommit's `inputFileGroup`. After clustering, the same row 
will exist in 2 different fileGroups as clustering will create new fileGroups 
for clustered data.
   
   Reference: A replaceCommit does the following:
   ```
   replaceCommit1 = input{FG_0, FG_1, FG_2} -> output{FG_3}
   replaceCommit2 = input{FG_0, FG_1, FG_2} -> output{FG_4}
   ```
   
   If this is occurring, this seems like an issue with how Hudi is fetching 
`pendingClusteringPlans` no?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to