Sudarshan Kadambi created HBASE-12028:
-----------------------------------------

             Summary: Abort the RegionServer, when one of it's handler threads 
die
                 Key: HBASE-12028
                 URL: https://issues.apache.org/jira/browse/HBASE-12028
             Project: HBase
          Issue Type: Bug
          Components: regionserver
            Reporter: Sudarshan Kadambi


Over in HBase-11813, a user identified an issue where in all the RPC handler 
threads would exit with StackOverflow errors due to an unchecked 
recursion-terminating condition. Our clusters demonstrated the same trace. 
While the patch posted for HBASE-11813 got our clusters to be merry again, the 
breakdown surfaced some larger issues.

When the RegionServer had all it's RPC handler threads dead, it continued to 
have regions assigned it. Clearly, it wouldn't be able to serve reads and 
writes on those regions. A second issue was that when a user tried to disable 
or drop a table, the master would try to communicate to the regionserver for 
region unassignment. Since the same handler threads seem to be used for master 
<-> RS communication as well, the master ended up hanging on the RS 
indefinitely. Eventually, the master stopped responding to all table 
meta-operations.

A handler thread should never exit, and if it does, it seems like the more 
prudent thing to do would be for the RS to abort. This way, atleast recovery 
can be undertaken and the regions could be reassigned elsewhere. I also think 
that the master<->RS communication should get its own exclusive threadpool, but 
I'll wait until this issue has been sufficiently discussed before opening an 
issue ticket for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to