EdColeman commented on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-694639463


   It has occurred more than once, however, I'm not sure how to approach 
recreating it in a test environment.  It seems the trigger is that the dynamic 
classloader starts to thrash, continually reloading - I then suspect that this 
is consuming memory or other resources until things fail.  One thing that sets 
off the classloader is when new jars are deployed in hdfs for the system, but 
then its only one or two servers that get into this state.  On an another 
occasion, it was on a server restart.  
   
   I'd like to treat the classloader as a separate issue, The classloader issue 
may be a trigger, but the fact that the tserver can get into this state and 
write corrupt files probably should be addressed, even if that one trigger is 
eliminated, maybe there are others?
   
   The stacktrace points to the NullPointer coming from the Tables.exists() 
call - I'll work on getting a more detailed representation of it.
   
   Its clear that the tserver is unhealthy, and I agree, I am skeptical of 
Tables.exits returning a null, but if things are that bad, then I was trying to 
see if there was an approach that, philosophically, would pull the plug to 
eliminate or at least greatly reduce the blast footprint.
   
   The thing is that eventually the tserver does seem to die with a segfault - 
but only after a lengthy time (days) and the potential for corrupt files being 
written and causing data loss the whole time the tserver is in this state.  The 
bad files are only uncovered during the next compaction, so there is quite a 
delay between when the damage occurred and when Accumulo reports an issue other 
than in the tserver logs.  To find the problem, it is necessary to take the 
file reported as corrupt (it shows up in the UI error log) and then grep across 
all of the tserver log files looking for the compaction plan log message that 
created said file.
   
   Other than the corrupted files, the tserver seems to be operating normally - 
but that might not be a correct assumption - there are no other indications of 
errors - but that might be different than operating as intended. The delay for 
discovery does not help in answering this question.
   
   As far as 2.x - 1) again, not sure how to reliably trigger this. 2) with 2.x 
there could be an entirely different approach. 
   
   There is an open issue - https://github.com/apache/accumulo/issues/946 that 
could be implemented to achieve this.  My thought was that if the tserver knew 
that it needed a memory manager thread and that thread dies, then the tserver 
could take action - to either re-spawn an new process or terminate itself.  If 
the tserver was aware, then throwing an exception that kills the thread would 
be appropriate.  As it is now, I think that if the memory manager thread dies, 
the tserver would not know and that seemed like a sub-optimal condition, so I 
was exploring if it could be appropriate to kill the server directly - and 
removing the lock was the most direct way that I thought of.
   
   Implementing critical thread monitoring and recovery would be more 
comprehensive and probably a better approach - but the changes would be 
significant enough that I would not call them a bug fix.  Being that this is 
happening on 1.9.3 and was not addressed in 1.10, I think a bug fix is 
appropriate and necessary. Being that this has the potential for data loss, 
having an fix in 1.10.1 would likely be suitable for people to back port and 
patch - and I suspect the desire for that is more immediate than could be 
achieved by starting with 2.x as the target.
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to