EdColeman commented on issue #3138: URL: https://github.com/apache/accumulo/issues/3138#issuecomment-1410935644
Another dimension to this might occur if the dead server has just enough functionality to keep the ZooKeeper connection from timing out but otherwise unable to fully receive / respond to ZooKeeper events. What would "happen" is the zoo lock is deleted, which should force the tserver to stop hosting its tablets. The manager sees the tables unassigned, and assigns them to the another tserver. If the original tserver does not realize that it should not be hosting the tablets then both the original and the new server are serving the same tablets - which we make assumptions that it will never happen. There is an IT test, HalfDeadITServer that tries to test some of this, but not sure how much it actually covers. And I recall past attempts to mock / wrap an ZooKeeper client to inject various errors, but I am unsure how far they progressed. Most of this may be outside of this issue (if the Fate command is insufficient) - but there may be other issues that should be looked at. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
