EdColeman commented on PR #3366:
URL: https://github.com/apache/accumulo/pull/3366#issuecomment-1531553674

   My point is that the entire check seems to be trying to validate that the 
tserver can read and has a consistent view of the metadata tablet.   If this 
check cannot complete then I think the intention was to "hard" fail the tserver 
so that it was not trying to run with possibly incomplete / incorrect view of 
the metadata.
   
   Because it seems a transient (but somehow repeating) failure is stopping the 
tserver - a more conservative change would be to retry the scan (or the entire 
check) to guard against transient failures - but if the check cannot complete 
within a reasonable period, then it would be safer to continue to stop the 
process.
   
   Changing the check so that it only logs the exception seems that it could 
lead to cases where the tserver would continue to run when the intention is 
that it would have been halted.
   
   We may want to add resiliency to the check, but the underlying issue seems 
to be something in the environment is repeatably causing a scan failure while 
reading the metadata - nurffing the check to accommodate that case my just be 
hiding an underlying issue that needs to be corrected.  This change seems that 
it could allow other issues to also be hidden.
   
   Again, I am addressing the overall goal of the check - not the issue that a 
transient scan failure seems to be triggering more aggressively that necessary.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to