cshannon commented on PR #3366: URL: https://github.com/apache/accumulo/pull/3366#issuecomment-1530279093
> Note that the validation was not throwing an exception and hence was not actually halting the server (see Tablet.compareTabletInfo). It was halting because the tserver had a failure when attempting to scan the metadata. Hence catching Exception is OK here (or moving it to the non-critical thread pool) UNLESS you want to have the validation halt the server. In any case a scan failure should NOT halt the server. Right I wouldn't think we should halt either on scan failure so that's what this fix does, is prevent that. In terms of making it a non critical task I guess that is fine too, if we are catching the exception and continuing I'm not sure it much matters at that point. Either way it seems like catching an exception and logging it there vs killing the task makes sense so it can be retried normally so the task and server don't fall over. In terms of whether not we retry faster or some other logic on scan failure maybe that's a follow on issue if that is going to be more complex and maybe could be something for 2.1.2. This at least fixes the immediately issue that is a regression in 2.1 for 2.1.1 so the server doesn't fall over on a scan failure which just may be transient. Halting on metadata failure itself (not a scan failure) is another issue entirely of course and maybe that is valid but maybe shouldn't be a bug fix release and 3.0 instead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
