EdColeman commented on PR #3366:
URL: https://github.com/apache/accumulo/pull/3366#issuecomment-1529943361

   I am not sure that just logging the error and then waiting for the next 
check is appropriate, and a retry and fail after X attempts may be a better 
approach.
   
   For the case that triggered this, it seems the error is transient, but what 
if it was not?  I think the check was to validate that the tserver can read the 
entire metadata table - if it cannot, it may not be "safe" to continue to run 
with incomplete metadata information.  Even if those operations "fail" because 
the metadata cannot be known to be consistent, if we allow the tserver to keep 
running, then it seems like work would be continued to be assigned / attempted 
even though it will not work with the current tserver state.
   
   This would be similar the the issue where tservers could not host tablets, 
but the manager keeps seeing that tserver as having 0 tablets and kept trying 
to make assignments to that "under-utilized" tserver. (the solution was to 
detect and remove the tserver lock to stop assignment attempts.
   
   This change shifts the priority to "keeping the tserver up" rather than 
killing the process is it cannot fill a basic requirement of being able to 
maintain a consistent view of the metadata.  A retry may provide a balance of 
having a better chance to ride out transient errors, but still provide a hard 
fail if it cannot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to