[jira] [Commented] (ACCUMULO-1708) Error during minor compaction left tserver in bad state

Keith Turner (JIRA) Thu, 12 Sep 2013 23:32:46 -0700

    [ 
https://issues.apache.org/jira/browse/ACCUMULO-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13765821#comment-13765821
 ]


Keith Turner commented on ACCUMULO-1708:
----------------------------------------

I investigated using thread groups to handle this problem in a general way.  
Its possible to create a thread group that will handle any uncaught exceptions. 
 By default when a thread creates a new thread, the thread inherits its 
creators thread group.  Therefore threads created in zookeeper and hdfs code 
would inherit the calling threads thread group.  This seemed like a promising 
approach, however the zookeeper and hdfs code catch Throwable and log it.  If 
this code caught RunTimeException then the Thread group approach would work 
nicely because Errors would percolate up to the handler in the ThreadGroup.   

DFSClient.DFSOutputStream.DataStreamer
org.apache.zookeeper.ClientCnxn$EventThread.run()
org.apache.zookeeper.ClientCnxn$SendThread.run()

We can make all watchers we pass to zookeeper handle Errors, but it is still 
possible that zookeeper code executing in a background thread could encounter a 
OOME.  I think I have seen a zookeeper thread die from an OOME before and as a 
result a tserver that lost its lock did not die.   For the threads created by 
DFSClient I do not think we have any direct control.
                
> Error during minor compaction left tserver in bad state
> -------------------------------------------------------
>
>                 Key: ACCUMULO-1708
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-1708
>             Project: Accumulo
>          Issue Type: Bug
>    Affects Versions: 1.4.0
>            Reporter: Keith Turner
>            Priority: Critical
>             Fix For: 1.6.0
>
>
> A tserver experienced a OOME during minor compaction.  This OOME was thrown 
> because java could not create a native thread.  Minor compactions only catch 
> declared exceptions and RuntimeExceptions.  This left the system in a state 
> where the compaction was not running but the tserver thought it was.  This 
> cause"flush -w" to hang and prevented the tserver from reclaiming memory.
> For whatever reason the OOME handler that kills the process did not kick in 
> (seems it only kicks in w/ OOME related to heap allocation).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ACCUMULO-1708) Error during minor compaction left tserver in bad state

Reply via email to