EdColeman commented on PR #3366: URL: https://github.com/apache/accumulo/pull/3366#issuecomment-1531553674
My point is that the entire check seems to be trying to validate that the tserver can read and has a consistent view of the metadata tablet. If this check cannot complete then I think the intention was to "hard" fail the tserver so that it was not trying to run with possibly incomplete / incorrect view of the metadata. Because it seems a transient (but somehow repeating) failure is stopping the tserver - a more conservative change would be to retry the scan (or the entire check) to guard against transient failures - but if the check cannot complete within a reasonable period, then it would be safer to continue to stop the process. Changing the check so that it only logs the exception seems that it could lead to cases where the tserver would continue to run when the intention is that it would have been halted. We may want to add resiliency to the check, but the underlying issue seems to be something in the environment is repeatably causing a scan failure while reading the metadata - nurffing the check to accommodate that case my just be hiding an underlying issue that needs to be corrected. This change seems that it could allow other issues to also be hidden. Again, I am addressing the overall goal of the check - not the issue that a transient scan failure seems to be triggering more aggressively that necessary. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
