cshannon commented on PR #3366:
URL: https://github.com/apache/accumulo/pull/3366#issuecomment-1530279093

   > Note that the validation was not throwing an exception and hence was not 
actually halting the server (see Tablet.compareTabletInfo). It was halting 
because the tserver had a failure when attempting to scan the metadata. Hence 
catching Exception is OK here (or moving it to the non-critical thread pool) 
UNLESS you want to have the validation halt the server. In any case a scan 
failure should NOT halt the server.
   
   Right I wouldn't think we should halt either on scan failure so that's what 
this fix does, is prevent that. In terms of making it a non critical task I 
guess that is fine too, if we are catching the exception and continuing I'm not 
sure it much matters at that point. Either way it seems like catching an 
exception and logging it there vs killing the task makes sense so it can be 
retried normally so the task and server don't fall over.
   
   In terms of whether not we retry faster or some other logic on scan failure 
maybe that's a follow on issue if that is going to be more complex and maybe 
could be something for 2.1.2. This at least fixes the immediately issue that is 
a regression in 2.1 for 2.1.1 so the server doesn't fall over on a scan failure 
which just may be transient. Halting on metadata failure itself (not a scan 
failure) is another issue entirely of course and maybe that is valid but maybe 
shouldn't be a bug fix release and 3.0 instead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to