[
https://issues.apache.org/jira/browse/HBASE-13330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959848#comment-14959848
]
Hadoop QA commented on HBASE-13330:
-----------------------------------
{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12766862/13330-v2-branch-1.txt
against branch-1 branch at commit d5ed46bc9f9285f75d2d906ec9c120cb408827df.
ATTACHMENT ID: 12766862
{color:green}+1 @author{color}. The patch does not contain any @author
tags.
{color:green}+1 tests included{color}. The patch appears to include 3 new
or modified tests.
{color:green}+1 hadoop versions{color}. The patch compiles with all
supported hadoop versions (2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.0 2.6.1 2.7.0
2.7.1)
{color:green}+1 javac{color}. The applied patch does not increase the
total number of javac compiler warnings.
{color:green}+1 protoc{color}. The applied patch does not increase the
total number of protoc compiler warnings.
{color:green}+1 javadoc{color}. The javadoc tool did not generate any
warning messages.
{color:green}+1 checkstyle{color}. The applied patch does not increase the
total number of checkstyle errors
{color:green}+1 findbugs{color}. The patch does not introduce any new
Findbugs (version 2.0.3) warnings.
{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.
{color:green}+1 lineLengths{color}. The patch does not introduce lines
longer than 100
{color:green}+1 site{color}. The mvn post-site goal succeeds with this patch.
{color:red}-1 core tests{color}. The patch failed these unit tests:
{color:red}-1 core zombie tests{color}. There are 1 zombie test(s):
at org.apache.hadoop.hbase.client.TestShell.testRunShellTests(TestShell.java:35)
Test results:
https://builds.apache.org/job/PreCommit-HBASE-Build/16032//testReport/
Release Findbugs (version 2.0.3) warnings:
https://builds.apache.org/job/PreCommit-HBASE-Build/16032//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors:
https://builds.apache.org/job/PreCommit-HBASE-Build/16032//artifact/patchprocess/checkstyle-aggregate.html
Console output:
https://builds.apache.org/job/PreCommit-HBASE-Build/16032//console
This message is automatically generated.
> Region left unassigned due to AM & SSH each thinking the assignment would be
> done by the other
> ----------------------------------------------------------------------------------------------
>
> Key: HBASE-13330
> URL: https://issues.apache.org/jira/browse/HBASE-13330
> Project: HBase
> Issue Type: Bug
> Components: master, Region Assignment
> Affects Versions: 1.0.0
> Reporter: Devaraj Das
> Assignee: Devaraj Das
> Fix For: 2.0.0, 1.3.0, 1.2.1, 1.0.3, 1.1.4
>
> Attachments: 13330-branch-1.txt, 13330-v2-branch-1.txt,
> 13330-v3-branch-1.txt
>
>
> Here is what I found during analysis of an issue. Raising this jira and a fix
> will follow.
> The TL;DR of this is that the AssignmentManager thinks the
> ServerShutdownHandler would assign the region and the ServerShutdownHandler
> thinks that the AssignmentManager would assign the region. The region
> (0d6cf37c18c54c6f4744750c6a7be837) ultimately never gets assigned. Below is
> an analysis from the logs that captures the flow of events.
> 1. The AssignmentManager had initially assigned this region to
> dnj1-bcpc-r3n8.example.com,60020,1425598187703
> 2. When the master restarted it did a scan of the meta to learn about the
> regions in the cluster. It found this region being assigned to
> dnj1-bcpc-r3n8.example.com,60020,1425598187703 from the meta record.
> 3. However, this server (dnj1-bcpc-r3n8.example.com,60020,1425598187703) was
> not alive anymore. So, the AssignmentManager queued up a
> ServerShutdownHandling task for this (that asynchronously executes):
> {noformat}
> 2015-03-06 14:09:31,355 DEBUG org.apache.hadoop.hbase.master.ServerManager:
> Added=dnj1-bcpc-r3n8.example.com,60020,1425598187703 to dead servers,
> submitted shutdown handler to be executed meta=false
> {noformat}
> 4. The AssignmentManager proceeded to read the RIT nodes from ZK. It found
> this region as well:
> {noformat}
> 2015-03-06 14:09:31,527 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Processing
> 0d6cf37c18c54c6f4744750c6a7be837
> in state: RS_ZK_REGION_FAILED_OPEN
> {noformat}
> 5. The region was moved to CLOSED state:
> {noformat}
> 2015-03-06 14:09:31,527 WARN org.apache.hadoop.hbase.master.RegionStates:
> 0d6cf37c18c54c6f4744750c6a7be837 moved to CLOSED on
> dnj1-bcpc-r3n2.example.com,60020,1425603618259, expected
> dnj1-bcpc-r3n8.example.com,60020,1425598187703
> {noformat}
> Note the reference to dnj1-bcpc-r3n2.example.com,60020,1425603618259. This
> means that the region was assigned to
> dnj1-bcpc-r3n2.example.com,60020,1425603618259 but that regionserver couldn't
> open the region for some reason, and it changed the state to
> RS_ZK_REGION_FAILED_OPEN in RIT znode on ZK.
> 6. After that the AssignmentManager tried to assign it again. However, the
> assignment didn't happen because the ServerShutdownHandling task queued
> earlier didn't yet execute:
> {noformat}
> 2015-03-06 14:09:31,527 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Skip assigning
> phMonthlyVersion,\x89\x80\x00\x00,1423149098980.0d6cf37c18c54c6f4744750c6a7be837.,
> it's host dnj1-bcpc-r3n8.example.com,60020,1425598187703 is dead but not
> processed yet
> {noformat}
> 7. Eventually the ServerShutdownHandling task executed.
> {noformat}
> 2015-03-06 14:09:35,188 INFO
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs
> for dnj1-bcpc-r3n8.example.com,60020,1425598187703 before assignment.
> 2015-03-06 14:09:35,209 INFO
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning 19
> region(s) that dnj1-bcpc-r3n8.example.com,60020,1425598187703 was
> carrying (and 0 regions(s) that were opening on this server)
> 2015-03-06 14:09:35,211 INFO
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Finished
> processing of shutdown of dnj1-bcpc-r3n8.example.com,60020,1425598187703
> {noformat}
> 8. However, the ServerShutdownHandling task skipped the region in question.
> This was because this region was in RIT, and the ServerShutdownHandling task
> thinks that the AssignmentManager would assign it as part of handling the RIT
> nodes:
> {noformat}
> 2015-03-06 14:09:35,210 INFO
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Skip assigning
> region in transition on other server{0d6cf37c18c54c6f4744750c6a7be837
> state=CLOSED, ts=1425668971527,
> server=dnj1-bcpc-r3n2.example.com,60020,1425603618259}
> {noformat}
> 9. At some point in the future, when the server
> dnj1-bcpc-r3n2.example.com,60020,1425603618259 dies, the
> ServerShutdownHandling for it gets queued up (from the log
> hbase-hbase-master-dnj1-bcpc-r3n1.log):
> {noformat}
> 2015-03-09 11:35:10,607 INFO
> org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer ephemeral
> node deleted,
> processing expiration [dnj1-bcpc-r3n2.example.com,60020,1425603618259]
> {noformat}
> 10. In RegionStates.java:serverOffline, there is a check that happens on the
> state of the region's state. Since the region is in CLOSED state, the log is
> displayed:
> {noformat}
> 2015-03-09 11:35:15,711 WARN org.apache.hadoop.hbase.master.RegionStates:
> THIS SHOULD NOT HAPPEN: unexpected {0d6cf37c18c54c6f4744750c6a7be837
> state=CLOSED, ts=1425668971527,
> server=dnj1-bcpc-r3n2.example.com,60020,1425603618259}
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)