keith-turner commented on issue #5047:
URL: https://github.com/apache/accumulo/issues/5047#issuecomment-2518618654

   Examining jstacks of a running manager I was seeing lots of threads stuck in 
ZooCache waiting to resolve if a tablet server currently had its lock.  Things 
were getting stuck in zoocache because it was continually having to go to 
zookeeper.  There were only a few tablet servers, so it was not clear why they 
were getting stuck. After apply the change in #5133 was able to see why 
zoocache was continually missing and it was because of this issue plus there 
being multiple tserver resource groups.  This problem results in every scan or 
write to accumulo hitting zookeeper.  The following is what happened.
   
   
    1. A scan or write does a check using zoocache to see if the tablet server 
is currently holding its lock in zookeeper.
    2. When there are multiple resource groups the check will look for the 
tserver in each resource group calling 
`zooCache.getChildren("<zkroot>/tservers/<resource group>/<tserver>`. 
    3. The tservers only exists in one RG and for the ones it does not 
getChildren does not cache non-existence so it will fall back to zookeeper.
    4. While the getChildren call is made to zookeeper all other threads block 
for the same path.
   
   If there was a single resource group for tablet servers then this problem 
would not be seen.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to