EdColeman commented on PR #3366: URL: https://github.com/apache/accumulo/pull/3366#issuecomment-1529943361
I am not sure that just logging the error and then waiting for the next check is appropriate, and a retry and fail after X attempts may be a better approach. For the case that triggered this, it seems the error is transient, but what if it was not? I think the check was to validate that the tserver can read the entire metadata table - if it cannot, it may not be "safe" to continue to run with incomplete metadata information. Even if those operations "fail" because the metadata cannot be known to be consistent, if we allow the tserver to keep running, then it seems like work would be continued to be assigned / attempted even though it will not work with the current tserver state. This would be similar the the issue where tservers could not host tablets, but the manager keeps seeing that tserver as having 0 tablets and kept trying to make assignments to that "under-utilized" tserver. (the solution was to detect and remove the tserver lock to stop assignment attempts. This change shifts the priority to "keeping the tserver up" rather than killing the process is it cannot fill a basic requirement of being able to maintain a consistent view of the metadata. A retry may provide a balance of having a better chance to ride out transient errors, but still provide a hard fail if it cannot. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
