EdColeman commented on issue #1689: URL: https://github.com/apache/accumulo/issues/1689#issuecomment-696434005
Overall, there may be a difference in philosophy or focus. I do not disagree with the premises of anything that you wrote - except I think I'm coming at it from a different angle. I'm aware of the classloader rework - that's one reason I was less focused on the "how it got that way" and am trying to address "no matter how we got here, can we at least stop writing corrupt files". Whatever the root cause, we should try to protect ourselves, and if we can't recover, then at least stop from corrupting data. It seems very likely as this has unfolded that things are pointing to something external to Accumulo. But, I think its a bug that Accumulo keeps working (and working incorrectly). It should be a given that the hardware works - but it is impossible to provide that guarantee - things go wrong. I agree that catching Throwable may not always be appropriate - in this case it is not - so, for this one case, is there an acceptable solution? I've proposed one way. A second, and more general way could be to leverage `Thread.UncaughtExceptionHandler` - using that, the tserver could create the threads and assign a handler that would do essentially the same thing - stop the tserver, either by deleting the lock or whatever the preferred mechanism is, if the underlying "critical" thread dies. The we don't need to guard against unexpected exceptions - we let them kill the thread - and then decide to either kill the tserver or maybe spawn a new thread - if that could be determined to be appropriate and safe. 1) So, in general - for cases where we are catching `Throwable ` and that is causing issues, - would it be better if we stopped the tserver? 2) If it is determined that we want to stop, is deleting the lock acceptable, or is there a preferred, alternate method. While the general issue of catching and swallowing Throwable is a bigger issue - for this one case where we can identify a case that this is not appropriate - can we fix that and then examine the larger issue as time allows or when other cases are identified? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
