shlomitubul opened a new pull request, #3614:
URL: https://github.com/apache/celeborn/pull/3614

   ### What changes were proposed in this pull request?
   `getTaggedWorkers()` obtains a direct reference to the cached Set from 
`getWorkersWithTag()`and then calls `retainAll()` on it to intersect with other 
tags and available workers. Since retainAll() mutates the Set in-place, this 
permanently corrupts the cached entry. When multiple applications with 
different tag combinations share the same master, one app's intersection 
shrinks the cached Set, causing subsequent lookups by other apps to find fewer 
or zero workers. Once corrupted to an empty Set, all future slot requests fail 
with WORKER_EXCLUDED until the cache is refreshed.
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR resolve a correctness bug?
   Yes
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   custom image in my dev env + local test
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to