[
https://issues.apache.org/jira/browse/HBASE-14498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14998842#comment-14998842
]
Hadoop QA commented on HBASE-14498:
-----------------------------------
{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12771547/HBASE-14498-V3.patch
against master branch at commit 112900d0425a8157b89041f0e353ebf5cc259c69.
ATTACHMENT ID: 12771547
{color:green}+1 @author{color}. The patch does not contain any @author
tags.
{color:green}+1 tests included{color}. The patch appears to include 4 new
or modified tests.
{color:green}+1 hadoop versions{color}. The patch compiles with all
supported hadoop versions (2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.0 2.6.1 2.7.0
2.7.1)
{color:green}+1 javac{color}. The applied patch does not increase the
total number of javac compiler warnings.
{color:green}+1 protoc{color}. The applied patch does not increase the
total number of protoc compiler warnings.
{color:green}+1 javadoc{color}. The javadoc tool did not generate any
warning messages.
{color:green}+1 checkstyle{color}. The applied patch does not increase the
total number of checkstyle errors
{color:green}+1 findbugs{color}. The patch does not introduce any new
Findbugs (version 2.0.3) warnings.
{color:red}-1 release audit{color}. The applied patch generated 1 release
audit warnings (more than the master's current 0 warnings).
{color:green}+1 lineLengths{color}. The patch does not introduce lines
longer than 100
{color:green}+1 site{color}. The mvn post-site goal succeeds with this patch.
{color:green}+1 core tests{color}. The patch passed unit tests in .
Test results:
https://builds.apache.org/job/PreCommit-HBASE-Build/16478//testReport/
Release audit warnings:
https://builds.apache.org/job/PreCommit-HBASE-Build/16478//artifact/patchprocess/patchReleaseAuditWarnings.txt
Release Findbugs (version 2.0.3) warnings:
https://builds.apache.org/job/PreCommit-HBASE-Build/16478//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors:
https://builds.apache.org/job/PreCommit-HBASE-Build/16478//artifact/patchprocess/checkstyle-aggregate.html
Console output:
https://builds.apache.org/job/PreCommit-HBASE-Build/16478//console
This message is automatically generated.
> Master stuck in infinite loop when all Zookeeper servers are unreachable.
> -------------------------------------------------------------------------
>
> Key: HBASE-14498
> URL: https://issues.apache.org/jira/browse/HBASE-14498
> Project: HBase
> Issue Type: Bug
> Components: master
> Reporter: Y. SREENIVASULU REDDY
> Assignee: Pankaj Kumar
> Priority: Blocker
> Attachments: HBASE-14498-V2.patch, HBASE-14498-V3.patch,
> HBASE-14498.patch
>
>
> We met a weird scenario in our production environment.
> In a HA cluster,
> > Active Master (HM1) is not able to connect to any Zookeeper server (due to
> > N/w breakdown on master machine network with Zookeeper servers).
> {code}
> 2015-09-26 15:24:47,508 INFO
> [HM1-Host:16000.activeMasterManager-SendThread(ZK-Host:2181)]
> zookeeper.ClientCnxn: Client session timed out, have not heard from server in
> 33463ms for sessionid 0x104576b8dda0002, closing socket connection and
> attempting reconnect
> 2015-09-26 15:24:47,877 INFO
> [HM1-Host:16000.activeMasterManager-SendThread(ZK-Host1:2181)]
> client.FourLetterWordMain: connecting to ZK-Host1 2181
> 2015-09-26 15:24:48,236 INFO [main-SendThread(ZK-Host1:2181)]
> client.FourLetterWordMain: connecting to ZK-Host1 2181
> 2015-09-26 15:24:49,879 WARN
> [HM1-Host:16000.activeMasterManager-SendThread(ZK-Host1:2181)]
> zookeeper.ClientCnxn: Can not get the principle name from server ZK-Host1
> 2015-09-26 15:24:49,879 INFO
> [HM1-Host:16000.activeMasterManager-SendThread(ZK-Host1:2181)]
> zookeeper.ClientCnxn: Opening socket connection to server
> ZK-Host1/ZK-IP1:2181. Will not attempt to authenticate using SASL (unknown
> error)
> 2015-09-26 15:24:50,238 WARN [main-SendThread(ZK-Host1:2181)]
> zookeeper.ClientCnxn: Can not get the principle name from server ZK-Host1
> 2015-09-26 15:24:50,238 INFO [main-SendThread(ZK-Host1:2181)]
> zookeeper.ClientCnxn: Opening socket connection to server
> ZK-Host1/ZK-Host1:2181. Will not attempt to authenticate using SASL (unknown
> error)
> 2015-09-26 15:25:17,470 INFO [main-SendThread(ZK-Host1:2181)]
> zookeeper.ClientCnxn: Client session timed out, have not heard from server in
> 30023ms for sessionid 0x2045762cc710006, closing socket connection and
> attempting reconnect
> 2015-09-26 15:25:17,571 WARN [master/HM1-Host/HM1-IP:16000]
> zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper,
> quorum=ZK-Host:2181,ZK-Host1:2181,ZK-Host2:2181,
> exception=org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /hbase/master
> 2015-09-26 15:25:17,872 INFO [main-SendThread(ZK-Host:2181)]
> client.FourLetterWordMain: connecting to ZK-Host 2181
> 2015-09-26 15:25:19,874 WARN [main-SendThread(ZK-Host:2181)]
> zookeeper.ClientCnxn: Can not get the principle name from server ZK-Host
> 2015-09-26 15:25:19,874 INFO [main-SendThread(ZK-Host:2181)]
> zookeeper.ClientCnxn: Opening socket connection to server ZK-Host/ZK-IP:2181.
> Will not attempt to authenticate using SASL (unknown error)
> {code}
> > Since HM1 was not able to connect to any ZK, so session timeout didnt
> > happen at Zookeeper server side and HM1 didnt abort.
> > On Zookeeper session timeout standby master (HM2) registered himself as an
> > active master.
> > HM2 is keep on waiting for region server to report him as part of active
> > master intialization.
> {noformat}
> 2015-09-26 15:24:44,928 | INFO | HM2-Host:21300.activeMasterManager | Waiting
> for region servers count to settle; currently checked in 0, slept for 0 ms,
> expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval
> of 1500 ms. |
> org.apache.hadoop.hbase.master.ServerManager.waitForRegionServers(ServerManager.java:1011)
> ---
> ---
> 2015-09-26 15:32:50,841 | INFO | HM2-Host:21300.activeMasterManager | Waiting
> for region servers count to settle; currently checked in 0, slept for 483913
> ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms,
> interval of 1500 ms. |
> org.apache.hadoop.hbase.master.ServerManager.waitForRegionServers(ServerManager.java:1011)
> {noformat}
> > At other end, region servers are reporting to HM1 on 3 sec interval. Here
> > region server retrieve master location from zookeeper only when they
> > couldn't connect to Master (ServiceException).
> Region Server will not report HM2 as per current design until unless HM1
> abort,so HM2 will exit(InitializationMonitor) and again wait for region
> servers in loop.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)