[jira] [Commented] (HBASE-13330) Region left unassigned due to AM & SSH each thinking the assignment would be done by the other

Hadoop QA (JIRA) Mon, 06 Jul 2015 19:02:24 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-13330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14616059#comment-14616059
 ]


Hadoop QA commented on HBASE-13330:
-----------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12724681/13330-branch-1.txt
  against branch-1 branch at commit c220635c7893c96db675cb2b80af6ade4a44e3d4.
  ATTACHMENT ID: 12724681

    {color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

    {color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
                        Please justify why no new tests are needed for this 
patch.
                        Also please list what manual steps were performed to 
verify this patch.

    {color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.0 2.7.0)

    {color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

    {color:green}+1 protoc{color}.  The applied patch does not increase the 
total number of protoc compiler warnings.

    {color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

    {color:green}+1 checkstyle{color}.  The applied patch does not increase the 
total number of checkstyle errors

    {color:green}+1 findbugs{color}.  The patch does not introduce any  new 
Findbugs (version 2.0.3) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

    {color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn post-site goal succeeds with this patch.

     {color:red}-1 core tests{color}.  The patch failed these unit tests:
     

     {color:red}-1 core zombie tests{color}.  There are 7 zombie test(s):       
at 
org.apache.hadoop.hbase.replication.TestReplicationSmallTests.testVerifyRepJob(TestReplicationSmallTests.java:481)
        at 
org.apache.hadoop.hbase.master.TestAssignmentManagerOnCluster.testOfflineRegion(TestAssignmentManagerOnCluster.java:377)
        at 
org.apache.hadoop.hbase.mapreduce.TestImportExport.testWithFilter(TestImportExport.java:459)
        at 
org.apache.hadoop.hbase.replication.TestPerTableCFReplication.testPerTableCFReplication(TestPerTableCFReplication.java:284)
        at 
org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancer.testWithCluster(TestStochasticLoadBalancer.java:656)
        at 
org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancer.testWithCluster(TestStochasticLoadBalancer.java:645)
        at 
org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancer.testMidCluster2(TestStochasticLoadBalancer.java:475)
        at 
org.apache.hadoop.hbase.replication.TestReplicationSyncUpTool.testSyncUpTool(TestReplicationSyncUpTool.java:173)

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14683//testReport/
Release Findbugs (version 2.0.3)        warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14683//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14683//artifact/patchprocess/checkstyle-aggregate.html

  Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14683//console

This message is automatically generated.

> Region left unassigned due to AM & SSH each thinking the assignment would be 
> done by the other
> ----------------------------------------------------------------------------------------------
>
>                 Key: HBASE-13330
>                 URL: https://issues.apache.org/jira/browse/HBASE-13330
>             Project: HBase
>          Issue Type: Bug
>          Components: master, Region Assignment
>    Affects Versions: 1.0.0
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 2.0.0, 1.1.2, 1.3.0, 1.2.1, 1.0.3
>
>         Attachments: 13330-branch-1.txt
>
>
> Here is what I found during analysis of an issue. Raising this jira and a fix 
> will follow.
> The TL;DR of this is that the AssignmentManager thinks the 
> ServerShutdownHandler would assign the region and the ServerShutdownHandler 
> thinks that the AssignmentManager would assign the region. The region 
> (0d6cf37c18c54c6f4744750c6a7be837) ultimately never gets assigned. Below is 
> an analysis from the logs that captures the flow of events.
> 1. The AssignmentManager had initially assigned this region to 
> dnj1-bcpc-r3n8.example.com,60020,1425598187703
> 2. When the master restarted it did a scan of the meta to learn about the 
> regions in the cluster. It found this region being assigned to 
> dnj1-bcpc-r3n8.example.com,60020,1425598187703 from the meta record.
> 3. However, this server (dnj1-bcpc-r3n8.example.com,60020,1425598187703) was 
> not alive anymore. So, the AssignmentManager queued up a 
> ServerShutdownHandling task for this (that asynchronously executes):
> {noformat}
> 2015-03-06 14:09:31,355 DEBUG org.apache.hadoop.hbase.master.ServerManager: 
> Added=dnj1-bcpc-r3n8.example.com,60020,1425598187703 to dead servers,
>  submitted shutdown handler to be executed meta=false
> {noformat}
> 4. The AssignmentManager proceeded to read the RIT nodes from ZK. It found 
> this region as well:
> {noformat}
> 2015-03-06 14:09:31,527 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Processing 
> 0d6cf37c18c54c6f4744750c6a7be837
> in state: RS_ZK_REGION_FAILED_OPEN
> {noformat}
> 5. The region was moved to CLOSED state:
> {noformat}
> 2015-03-06 14:09:31,527 WARN org.apache.hadoop.hbase.master.RegionStates: 
> 0d6cf37c18c54c6f4744750c6a7be837 moved to CLOSED on
> dnj1-bcpc-r3n2.example.com,60020,1425603618259, expected 
> dnj1-bcpc-r3n8.example.com,60020,1425598187703
> {noformat}
> Note the reference to dnj1-bcpc-r3n2.example.com,60020,1425603618259. This 
> means that the region was assigned to 
> dnj1-bcpc-r3n2.example.com,60020,1425603618259 but that regionserver couldn't 
> open the region for some reason, and it changed the state to 
> RS_ZK_REGION_FAILED_OPEN in RIT znode on ZK.
> 6. After that the AssignmentManager tried to assign it again. However, the 
> assignment didn't happen because the ServerShutdownHandling task queued 
> earlier didn't yet execute:
> {noformat}
> 2015-03-06 14:09:31,527 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Skip assigning 
> phMonthlyVersion,\x89\x80\x00\x00,1423149098980.0d6cf37c18c54c6f4744750c6a7be837.,
>  it's host dnj1-bcpc-r3n8.example.com,60020,1425598187703 is dead but not 
> processed yet
> {noformat}
> 7. Eventually the ServerShutdownHandling task executed.
> {noformat}
> 2015-03-06 14:09:35,188 INFO 
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs 
> for dnj1-bcpc-r3n8.example.com,60020,1425598187703 before assignment.
> 2015-03-06 14:09:35,209 INFO 
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning 19 
> region(s) that dnj1-bcpc-r3n8.example.com,60020,1425598187703 was
>  carrying (and 0 regions(s) that were opening on this server)
> 2015-03-06 14:09:35,211 INFO 
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Finished 
> processing of shutdown of dnj1-bcpc-r3n8.example.com,60020,1425598187703
> {noformat}
> 8. However, the ServerShutdownHandling task skipped the region in question. 
> This was because this region was in RIT, and the ServerShutdownHandling task 
> thinks that the AssignmentManager would assign it as part of handling the RIT 
> nodes:
> {noformat}
> 2015-03-06 14:09:35,210 INFO 
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Skip assigning 
> region in transition on other server{0d6cf37c18c54c6f4744750c6a7be837
> state=CLOSED, ts=1425668971527, 
> server=dnj1-bcpc-r3n2.example.com,60020,1425603618259}
> {noformat}
> 9. At some point in the future, when the server 
> dnj1-bcpc-r3n2.example.com,60020,1425603618259 dies, the 
> ServerShutdownHandling for it gets queued up (from the log 
> hbase-hbase-master-dnj1-bcpc-r3n1.log):
> {noformat}
> 2015-03-09 11:35:10,607 INFO 
> org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer ephemeral 
> node deleted,
> processing expiration [dnj1-bcpc-r3n2.example.com,60020,1425603618259]
> {noformat}
> 10. In RegionStates.java:serverOffline, there is a check that happens on the 
> state of the region's state. Since the region is in CLOSED state, the log is 
> displayed:
> {noformat}
> 2015-03-09 11:35:15,711 WARN org.apache.hadoop.hbase.master.RegionStates: 
> THIS SHOULD NOT HAPPEN: unexpected {0d6cf37c18c54c6f4744750c6a7be837 
> state=CLOSED, ts=1425668971527, 
> server=dnj1-bcpc-r3n2.example.com,60020,1425603618259}
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13330) Region left unassigned due to AM & SSH each thinking the assignment would be done by the other

Reply via email to