EdColeman commented on issue #1689: URL: https://github.com/apache/accumulo/issues/1689#issuecomment-694639463
It has occurred more than once, however, I'm not sure how to approach recreating it in a test environment. It seems the trigger is that the dynamic classloader starts to thrash, continually reloading - I then suspect that this is consuming memory or other resources until things fail. One thing that sets off the classloader is when new jars are deployed in hdfs for the system, but then its only one or two servers that get into this state. On an another occasion, it was on a server restart. I'd like to treat the classloader as a separate issue, The classloader issue may be a trigger, but the fact that the tserver can get into this state and write corrupt files probably should be addressed, even if that one trigger is eliminated, maybe there are others? The stacktrace points to the NullPointer coming from the Tables.exists() call - I'll work on getting a more detailed representation of it. Its clear that the tserver is unhealthy, and I agree, I am skeptical of Tables.exits returning a null, but if things are that bad, then I was trying to see if there was an approach that, philosophically, would pull the plug to eliminate or at least greatly reduce the blast footprint. The thing is that eventually the tserver does seem to die with a segfault - but only after a lengthy time (days) and the potential for corrupt files being written and causing data loss the whole time the tserver is in this state. The bad files are only uncovered during the next compaction, so there is quite a delay between when the damage occurred and when Accumulo reports an issue other than in the tserver logs. To find the problem, it is necessary to take the file reported as corrupt (it shows up in the UI error log) and then grep across all of the tserver log files looking for the compaction plan log message that created said file. Other than the corrupted files, the tserver seems to be operating normally - but that might not be a correct assumption - there are no other indications of errors - but that might be different than operating as intended. The delay for discovery does not help in answering this question. As far as 2.x - 1) again, not sure how to reliably trigger this. 2) with 2.x there could be an entirely different approach. There is an open issue - https://github.com/apache/accumulo/issues/946 that could be implemented to achieve this. My thought was that if the tserver knew that it needed a memory manager thread and that thread dies, then the tserver could take action - to either re-spawn an new process or terminate itself. If the tserver was aware, then throwing an exception that kills the thread would be appropriate. As it is now, I think that if the memory manager thread dies, the tserver would not know and that seemed like a sub-optimal condition, so I was exploring if it could be appropriate to kill the server directly - and removing the lock was the most direct way that I thought of. Implementing critical thread monitoring and recovery would be more comprehensive and probably a better approach - but the changes would be significant enough that I would not call them a bug fix. Being that this is happening on 1.9.3 and was not addressed in 1.10, I think a bug fix is appropriate and necessary. Being that this has the potential for data loss, having an fix in 1.10.1 would likely be suitable for people to back port and patch - and I suspect the desire for that is more immediate than could be achieved by starting with 2.x as the target. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
