keith-turner commented on issue #1460:
URL: https://github.com/apache/accumulo/issues/1460#issuecomment-769276457
Looking into this code a bit, below are some potential problems.
1. When there is a problem connecting to a tserver, it [tries to connect to
the tserver to ask it to
halt](https://github.com/apache/accumulo/blob/6af856b38bb57f49418547a8b9c262c66a1fe31e/server/manager/src/main/java/org/apache/accumulo/master/Master.java#L931-L934).
2. Since the code for dealing with potentially bad tservers is in the code
path for getting stats, it could potentially impede getting status from
functioning tservers.
3. The code [uses a hard coded
count](https://github.com/apache/accumulo/blob/6af856b38bb57f49418547a8b9c262c66a1fe31e/server/manager/src/main/java/org/apache/accumulo/master/Master.java#L170)
of the number of times a tserver could not be connected to decide if a tserver
is bad.
4. The code has no concept of the big picture, if there is 1/1000 bad
server vs 999/1000. Each bad tserver is handled independently. The rate
limiting changes made in #1456 dealt with this in a way.
I think there is value in what the code is attempting to do (react to
tservers that the manager can not connect to), but it could be improved.
Accumulo tservers must maintain a heartbeat with a ZK server inorder to be
considered alive. There are situations where a tservers is maintaining this ZK
heartbeat, but new connections can not be made to the tserver. When the
manager attempts to get status from a tserver, it does not currently use a
pooled connection. If the manager can not form a new connection, then clients
likely can not either.
One possible improvement that could be made to address problems above is the
creation of an internal service in the manager to deal with bad tservers. The
code that gathers status would queue tservers it had problems connecting with
to this service. The service would decide when to react and have its own
threads for reacting. The service can consider the big picture and queuing
something for it to process does not need to block gathering status.
For problem 3, the hard coded count, this criteria for classifying a
tserver as bad does not seem ideal. Seems like a configurable criteria based
on time and count would be better. Like if at least Y attempt to contact a
tserver in a X minute period failed, classify it as bad.
For problem 1, asking the tserver to halt, if the manager can not connect to
the tserver to get status then its chances of connecting to the tserver to ask
it to halt are diminished. May be better to delete its ZK lock and/or ask it
to halt.
Overall, I am thinking there is value in what the code is attempting to do
and that it needs to be improved.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]