gavinchou commented on PR #64167:
URL: https://github.com/apache/doris/pull/64167#issuecomment-4849868512

   Robustness issue in 
`TenantLevelColocateTableCheckerAndBalancer#matchGroups()`: one bad group can 
abort the whole checker round.
   
   `matchGroups()` iterates all tenant-level colocate groups without per-group 
exception isolation. Some deeper checks use `Preconditions.checkState(...)`, 
for example when matching backend bucket sequences and tablet counts. If one 
group has inconsistent metadata/tablets and throws, the method exits and later 
groups are not checked or repaired in this round. If the bad group stays 
inconsistent, it can keep blocking other groups.
   
   Relevant code paths:
   - group loop without per-group try/catch: 
https://github.com/apache/doris/blob/cde59482ce5a548a2652c3aead57096a9c832f22/fe/fe-core/src/main/java/org/apache/doris/clone/TenantLevelColocateTableCheckerAndBalancer.java#L186-L221
   - `Preconditions` inside matching: 
https://github.com/apache/doris/blob/cde59482ce5a548a2652c3aead57096a9c832f22/fe/fe-core/src/main/java/org/apache/doris/clone/TenantLevelColocateTableCheckerAndBalancer.java#L301-L323
   
   Can we isolate failures per group, log/mark only that group as unstable, and 
continue checking the remaining groups? That would also match the tenant-level 
isolation goal of this feature.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to