[jira] [Commented] (HBASE-12028) Abort the RegionServer, when one of it's handler threads die

Hadoop QA (JIRA) Sun, 07 Dec 2014 17:50:30 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237353#comment-14237353
 ]


Hadoop QA commented on HBASE-12028:
-----------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12685642/Hbase-12028.patch
  against master branch at commit 9fd6db3703d3e7ec50b32b1e96c65ed9f2b1456d.
  ATTACHMENT ID: 12685642

    {color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

    {color:green}+1 tests included{color}.  The patch appears to include 5 new 
or modified tests.

    {color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

    {color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

    {color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

                {color:red}-1 checkstyle{color}.  The applied patch generated 
2103 checkstyle errors (more than the master's current 2089 errors).

    {color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

    {color:red}-1 lineLengths{color}.  The patch introduces the following lines 
longer than 100:
    +  public static final String REGION_SERVER_HANDLERFAILURE_PERCENT = 
"hbase.regionserver.handlerfailure.percent";
+      final float readShare, final int maxQueueLength, final Configuration 
conf, final Abortable abortable) {
+    this(name, handlerCount, numQueues, readShare, maxQueueLength, 0, conf, 
abortable, LinkedBlockingQueue.class);
+      final float readShare, final float scanShare, final int maxQueueLength, 
final Configuration conf, final Abortable abortable) {
+      final float readShare, final int maxQueueLength, final Configuration 
conf, final Abortable abortable,
+      final float readShare, final float scanShare, final int maxQueueLength, 
final Configuration conf, final Abortable abortable, 
+              String message = "RpcServer handler thread encountered an 
exception, close client connection";
+        callExecutor = new RWQueueRpcExecutor("RW.default", handlerCount, 
numCallQueues, callqReadShare, callqScanShare, maxQueueLength, conf, abortable);
+        callExecutor = new BalancedQueueRpcExecutor("B.default", handlerCount, 
numCallQueues, conf, abortable,
+         this(conf, handlerCount, priorityHandlerCount, 
replicationHandlerCount, priority, null, highPriorityLevel);

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

     {color:red}-1 core tests{color}.  The patch failed these unit tests:
                       org.apache.hadoop.hbase.master.TestMaster
                  org.apache.hadoop.hbase.master.TestTableLockManager
                  org.apache.hadoop.hbase.master.TestMasterShutdown
                  org.apache.hadoop.hbase.client.TestReplicaWithCluster
                  org.apache.hadoop.hbase.TestRegionRebalancing
                  org.apache.hadoop.hbase.client.TestHCM
                  org.apache.hadoop.hbase.master.TestRestartCluster
                  org.apache.hadoop.hbase.master.handler.TestCreateTableHandler
                  org.apache.hadoop.hbase.master.TestMasterFailover
                  org.apache.hadoop.hbase.client.TestScannersFromClientSide
                  org.apache.hadoop.hbase.master.TestDistributedLogSplitting

     {color:red}-1 core zombie tests{color}.  There are 16 zombie test(s):      
at 
org.apache.hadoop.hbase.client.TestMetaScanner.testMetaScanner(TestMetaScanner.java:74)
        at 
org.apache.hadoop.hbase.mapreduce.TestTableSnapshotInputFormat.testInitTableSnapshotMapperJobConfig(TestTableSnapshotInputFormat.java:157)
        at 
org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat.testMRIncrementalLoad(TestHFileOutputFormat.java:360)
        at 
org.apache.hadoop.hbase.master.TestAssignmentManagerOnCluster.testAssignDisabledRegion(TestAssignmentManagerOnCluster.java:931)
        at 
org.apache.hadoop.hbase.master.TestAssignmentManagerOnCluster.testMoveRegion(TestAssignmentManagerOnCluster.java:353)
        at 
org.apache.hadoop.hbase.master.TestAssignmentManagerOnCluster.testReportRegionStateTransition(TestAssignmentManagerOnCluster.java:1121)
        at 
org.apache.hadoop.hbase.master.TestAssignmentManagerOnCluster.testAssignOfflinedRegionBySSH(TestAssignmentManagerOnCluster.java:946)
        at 
org.apache.hadoop.hbase.master.TestAssignmentManagerOnCluster.testAssignWhileClosing(TestAssignmentManagerOnCluster.java:437)
        at 
org.apache.hadoop.hbase.master.TestAssignmentManagerOnCluster.testOfflineRegion(TestAssignmentManagerOnCluster.java:303)
        at 
org.apache.hadoop.hbase.master.TestAssignmentManagerOnCluster.testSSHWaitForServerToAssignRegion(TestAssignmentManagerOnCluster.java:853)
        at 
org.apache.hadoop.hbase.master.TestAssignmentManagerOnCluster.testSSHWhenDisablingTableRegionsInOpeningOrPendingOpenState(TestAssignmentManagerOnCluster.java:622)
        at 
org.apache.hadoop.hbase.master.TestAssignmentManagerOnCluster.testAssignRacingWithSSH(TestAssignmentManagerOnCluster.java:777)
        at 
org.apache.hadoop.hbase.master.TestAssignmentManagerOnCluster.testCloseHang(TestAssignmentManagerOnCluster.java:659)
        at 
org.apache.hadoop.hbase.master.TestAssignmentManagerOnCluster.testMoveRegionOfDeletedTable(TestAssignmentManagerOnCluster.java:368)
        at 
org.apache.hadoop.hbase.master.TestAssignmentManagerOnCluster.testAssignDisabledRegionBySSH(TestAssignmentManagerOnCluster.java:1086)
        at 
org.apache.hadoop.hbase.master.TestAssignmentManagerOnCluster.testOpenFailed(TestAssignmentManagerOnCluster.java:560)
        at 
org.apache.hadoop.hbase.master.TestAssignmentManagerOnCluster.testOpenFailedUnrecoverable(TestAssignmentManagerOnCluster.java:609)
        at 
org.apache.hadoop.hbase.master.TestAssignmentManagerOnCluster.testCloseFailed(TestAssignmentManagerOnCluster.java:519)
        at 
org.apache.hadoop.hbase.master.TestAssignmentManagerOnCluster.testAssignDisabledRegion(TestAssignmentManagerOnCluster.java:931)
        at 
org.apache.hadoop.hbase.util.TestHBaseFsck.testFixHdfsHolesNotWorkingWithNoHdfsChecking(TestHBaseFsck.java:1903)
        at 
org.apache.hadoop.hbase.client.TestTableSnapshotScanner.testScanner(TestTableSnapshotScanner.java:133)
        at 
org.apache.hadoop.hbase.client.TestTableSnapshotScanner.testWithOfflineHBaseMultiRegion(TestTableSnapshotScanner.java:125)
        at 
org.apache.hadoop.hbase.master.TestMasterOperationsForRegionReplicas.testCreateTableWithMultipleReplicas(TestMasterOperationsForRegionReplicas.java:230)
        at 
org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat2.testMRIncrementalLoad(TestHFileOutputFormat2.java:359)
        at 
org.apache.hadoop.hbase.client.TestFromClientSide.testIllegalTableDescriptor(TestFromClientSide.java:5428)
        at 
org.apache.hadoop.hbase.master.TestHMasterRPCException.testRPCException(TestHMasterRPCException.java:67)
        at 
org.apache.hadoop.hbase.TestAcidGuarantees.testGetAtomicity(TestAcidGuarantees.java:334)
        at 
org.apache.hadoop.hbase.client.TestFromClientSide.testIllegalTableDescriptor(TestFromClientSide.java:5428)

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11989//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11989//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11989//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11989//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11989//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11989//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11989//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11989//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11989//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11989//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11989//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11989//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11989//artifact/patchprocess/checkstyle-aggregate.html

                Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11989//console

This message is automatically generated.

> Abort the RegionServer, when one of it's handler threads die
> ------------------------------------------------------------
>
>                 Key: HBASE-12028
>                 URL: https://issues.apache.org/jira/browse/HBASE-12028
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>            Reporter: Sudarshan Kadambi
>            Assignee: Alicia Ying Shu
>         Attachments: Hbase-12028.patch
>
>
> Over in HBase-11813, a user identified an issue where in all the RPC handler 
> threads would exit with StackOverflow errors due to an unchecked 
> recursion-terminating condition. Our clusters demonstrated the same trace. 
> While the patch posted for HBASE-11813 got our clusters to be merry again, 
> the breakdown surfaced some larger issues.
> When the RegionServer had all it's RPC handler threads dead, it continued to 
> have regions assigned it. Clearly, it wouldn't be able to serve reads and 
> writes on those regions. A second issue was that when a user tried to disable 
> or drop a table, the master would try to communicate to the regionserver for 
> region unassignment. Since the same handler threads seem to be used for 
> master <-> RS communication as well, the master ended up hanging on the RS 
> indefinitely. Eventually, the master stopped responding to all table 
> meta-operations.
> A handler thread should never exit, and if it does, it seems like the more 
> prudent thing to do would be for the RS to abort. This way, at least recovery 
> can be undertaken and the regions could be reassigned elsewhere. I also think 
> that the master<->RS communication should get its own exclusive threadpool, 
> but I'll wait until this issue has been sufficiently discussed before opening 
> an issue ticket for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12028) Abort the RegionServer, when one of it's handler threads die

Reply via email to