[jira] [Commented] (HBASE-12028) Abort the RegionServer, when one of it's handler threads die

Andrew Purtell (JIRA) Sat, 20 Sep 2014 10:25:51 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142123#comment-14142123
 ]


Andrew Purtell commented on HBASE-12028:
----------------------------------------

Aborting the RS if a handler thread terminates due to an unhandled throwable 
seems like a good idea. Who knows what state the thread left things in, what 
operation is unclosed, what lock is still held, what counter was not 
decremented. The RS could be a ticking bomb. 

> Abort the RegionServer, when one of it's handler threads die
> ------------------------------------------------------------
>
>                 Key: HBASE-12028
>                 URL: https://issues.apache.org/jira/browse/HBASE-12028
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>            Reporter: Sudarshan Kadambi
>
> Over in HBase-11813, a user identified an issue where in all the RPC handler 
> threads would exit with StackOverflow errors due to an unchecked 
> recursion-terminating condition. Our clusters demonstrated the same trace. 
> While the patch posted for HBASE-11813 got our clusters to be merry again, 
> the breakdown surfaced some larger issues.
> When the RegionServer had all it's RPC handler threads dead, it continued to 
> have regions assigned it. Clearly, it wouldn't be able to serve reads and 
> writes on those regions. A second issue was that when a user tried to disable 
> or drop a table, the master would try to communicate to the regionserver for 
> region unassignment. Since the same handler threads seem to be used for 
> master <-> RS communication as well, the master ended up hanging on the RS 
> indefinitely. Eventually, the master stopped responding to all table 
> meta-operations.
> A handler thread should never exit, and if it does, it seems like the more 
> prudent thing to do would be for the RS to abort. This way, atleast recovery 
> can be undertaken and the regions could be reassigned elsewhere. I also think 
> that the master<->RS communication should get its own exclusive threadpool, 
> but I'll wait until this issue has been sufficiently discussed before opening 
> an issue ticket for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12028) Abort the RegionServer, when one of it's handler threads die

Reply via email to