keith-turner commented on issue #1460:
URL: https://github.com/apache/accumulo/issues/1460#issuecomment-769276457


   Looking into this code a bit, below are some potential problems.
   
    1. When there is a problem connecting to a tserver, it [tries to connect to 
the tserver to ask it to 
halt](https://github.com/apache/accumulo/blob/6af856b38bb57f49418547a8b9c262c66a1fe31e/server/manager/src/main/java/org/apache/accumulo/master/Master.java#L931-L934).
    2. Since the code for dealing with potentially bad tservers is in the code 
path for getting stats, it could potentially impede getting status from 
functioning tservers.
    3.  The code [uses a hard coded 
count](https://github.com/apache/accumulo/blob/6af856b38bb57f49418547a8b9c262c66a1fe31e/server/manager/src/main/java/org/apache/accumulo/master/Master.java#L170)
 of the number of times a tserver could not be connected to decide if a tserver 
is bad.
    4. The code has no concept of the big picture, if there is 1/1000 bad 
server vs 999/1000.  Each bad tserver is handled independently.  The rate 
limiting changes made in #1456 dealt with this in a way.
   
   I think there is value in what the code is attempting to do (react to 
tservers that the manager can not connect to), but it could be improved.   
Accumulo tservers must maintain a heartbeat with a ZK server inorder to be 
considered alive.  There are situations where a tservers is maintaining this ZK 
heartbeat, but new connections can not be made to the tserver.  When the 
manager attempts to get status from a tserver, it does not currently use a 
pooled connection.  If the manager can not form a new connection, then clients 
likely can not either.
   
   One possible improvement that could be made to address problems above is the 
creation of an internal service in the manager to deal with bad tservers.   The 
code that gathers status would queue tservers it had problems connecting with 
to this service.  The service would decide when to react and have its own 
threads for reacting.  The service can consider the big picture and queuing 
something for it to process does not need to block gathering status.
   
   For problem 3, the hard coded count,  this criteria for classifying a 
tserver as bad does not seem ideal.  Seems like a configurable criteria based 
on time and count would be better.  Like if at least Y attempt to contact a 
tserver in a X minute period failed, classify it as bad.
   
   For problem 1, asking the tserver to halt, if the manager can not connect to 
the tserver to get status then its chances of connecting to the tserver to ask 
it to halt are diminished.  May be better to delete its ZK lock and/or ask it 
to halt.
   
   Overall, I am thinking there is value in what the code is attempting to do 
and that it needs to be improved.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to