[
https://issues.apache.org/jira/browse/AMBARI-9467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304693#comment-14304693
]
Hadoop QA commented on AMBARI-9467:
-----------------------------------
{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12696363/AMBARI-9467.patch
against trunk revision .
{color:green}+1 @author{color}. The patch does not contain any @author
tags.
{color:red}-1 tests included{color}. The patch doesn't appear to include
any new or modified tests.
Please justify why no new tests are needed for this
patch.
Also please list what manual steps were performed to
verify this patch.
{color:green}+1 javac{color}. The applied patch does not increase the
total number of javac compiler warnings.
{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.
{color:red}-1 core tests{color}. The patch failed these unit tests in
ambari-server:
org.apache.ambari.server.state.cluster.ClusterTest
Test results:
https://builds.apache.org/job/Ambari-trunk-test-patch/1573//testReport/
Console output:
https://builds.apache.org/job/Ambari-trunk-test-patch/1573//console
This message is automatically generated.
> RU Hacks and Technical Debt - Namenode order of active/standby in code is
> flipped
> ---------------------------------------------------------------------------------
>
> Key: AMBARI-9467
> URL: https://issues.apache.org/jira/browse/AMBARI-9467
> Project: Ambari
> Issue Type: Bug
> Components: ambari-server
> Affects Versions: 2.0.0
> Reporter: Alejandro Fernandez
> Assignee: Alejandro Fernandez
> Fix For: 2.0.0
>
> Attachments: AMBARI-9467.patch
>
>
> Search for comments like # TODO, // TODO, # HACK, or // HACK
> UpgradeHelper somehow calls the active Namenode first, but this ends up being
> the standby namenode by the time it gets called; investigate why.
> Journal node restart:
> {code}
> 2015-02-04 01:40:21,682 - Executing Rolling Upgrade post-restart
> 2015-02-04 01:40:21,684 - Ensuring Journalnode quorum is established
> 2015-02-04 01:40:21,692 - Namenode HA State: {
> IDs: nn1, nn2
> Addresses: {u'nn2': (u'c6402.ambari.apache.org:50070',
> u'c6402.ambari.apache.org:50470'), u'nn1': (u'c6401.ambari.apache.org:50070',
> u'c6401.ambari.apache.org:50470')}
> States: {u'standby': set([u'c6401.ambari.apache.org']), u'active':
> set([u'c6402.ambari.apache.org'])}
> Encrypted: False
> Healthy: True
> }
> 2015-02-04 01:40:26,697 - u"Execute['hdfs dfsadmin -rollEdits']" {'tries': 1,
> 'user': 'hdfs'}
> 2015-02-04 01:40:41,490 - Could not get namenode state for nn2
> 2015-02-04 01:40:41,492 - Namenode HA State: {
> IDs: nn1, nn2
> Addresses: {u'nn2': (u'c6402.ambari.apache.org:50070',
> u'c6402.ambari.apache.org:50470'), u'nn1': (u'c6401.ambari.apache.org:50070',
> u'c6401.ambari.apache.org:50470')}
> States: {u'active': set([u'c6401.ambari.apache.org'])}
> Encrypted: False
> Healthy: False
> }
> {code}
> At around 2015-02-04 01:40:26, the Namenode logs on nn2 show,
> {code}
> 2015-02-04 01:40:13,217 INFO blockmanagement.CacheReplicationMonitor
> (CacheReplicationMonitor.java:run(201)) - Scanned 0 directive(s) and 0
> block(s) in 0 millisecond(s).
> 2015-02-04 01:40:29,454 INFO namenode.FSNamesystem
> (FSNamesystem.java:rollEditLog(6345)) - Roll Edit Log from 192.168.64.103
> 2015-02-04 01:40:29,454 INFO namenode.FSEditLog
> (FSEditLog.java:rollEditLog(1157)) - Rolling edit logs
> 2015-02-04 01:40:29,454 INFO namenode.FSEditLog
> (FSEditLog.java:endCurrentLogSegment(1214)) - Ending log segment 34
> 2015-02-04 01:40:29,454 INFO namenode.FSEditLog
> (FSEditLog.java:printStatistics(691)) - Number of transactions: 2 Total time
> for transactions(ms): 2 Number of transactions batched in Syncs: 0 Number of
> syncs: 1 SyncTimes(ms): 8 4
> 2015-02-04 01:40:29,465 WARN client.QuorumJournalManager
> (IPCLoggerChannel.java:call(388)) - Remote journal 192.168.64.101:8485 failed
> to write txns 35-35. Will try to write to this JN again after the next log
> roll.
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 3 is
> not the current writer epoch 2
> at
> org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:445)
> at
> org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:342)
> at
> org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
> at
> org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
> at
> org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
> at org.apache.hadoop.ipc.Client.call(Client.java:1468)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy12.journal(Unknown Source)
> at
> org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)
> at
> org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385)
> at
> org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> So the namenodes flipped order.
> Also, nn1 shows,
> {code}
> 2015-02-04 01:39:06,500 INFO blockmanagement.BlockManager
> (BlockManager.java:processReport(1815)) - BLOCK* processReport: Received
> first block report from
> DatanodeStorage[DS-9188ccbc-f03f-48f3-9314-ca6475d846f8,DISK,NORMAL] after
> starting up or becoming active. Its block contents are no longer considered
> stale
> 2015-02-04 01:39:06,500 INFO BlockStateChange
> (BlockManager.java:processReport(1831)) - BLOCK* processReport: from storage
> DS-9188ccbc-f03f-48f3-9314-ca6475d846f8 node
> DatanodeRegistration(192.168.64.103,
> datanodeUuid=54053f93-435e-4fce-950f-00d4deff1506, infoPort=50075,
> ipcPort=8010,
> storageInfo=lv=-56;cid=CID-373d38e6-6e88-486e-a88e-db76492ae751;nsid=1321340415;c=0),
> blocks: 1, hasStaleStorages: false, processing time: 16 msecs
> 2015-02-04 01:39:20,978 WARN namenode.FSNamesystem
> (FSNamesystem.java:getCorruptFiles(7875)) - Get corrupt file blocks returned
> error: Operation category READ is not supported in state standby
> 2015-02-04 01:39:20,978 WARN namenode.FSNamesystem
> (FSNamesystem.java:getCorruptFiles(7875)) - Get corrupt file blocks returned
> error: Operation category READ is not supported in state standby
> 2015-02-04 01:39:49,635 WARN namenode.FSNamesystem
> (FSNamesystem.java:getCorruptFiles(7875)) - Get corrupt file blocks returned
> error: Operation category READ is not supported in state standby
> 2015-02-04 01:39:49,655 WARN namenode.FSNamesystem
> (FSNamesystem.java:getCorruptFiles(7875)) - Get corrupt file blocks returned
> error: Operation category READ is not supported in state standby
> 2015-02-04 01:39:51,476 INFO blockmanagement.BlockManager
> (BlockManager.java:processReport(1815)) - BLOCK* processReport: Received
> first block report from
> DatanodeStorage[DS-ca53d96a-7630-475f-a502-47b59aba2ee0,DISK,NORMAL] after
> starting up or becoming active. Its block contents are no longer considered
> stale
> 2015-02-04 01:39:51,477 INFO BlockStateChange
> (BlockManager.java:processReport(1831)) - BLOCK* processReport: from storage
> DS-ca53d96a-7630-475f-a502-47b59aba2ee0 node
> DatanodeRegistration(192.168.64.101,
> datanodeUuid=faff4a42-fbe3-48d1-9592-4ef1f4fe077d, infoPort=50075,
> ipcPort=8010,
> storageInfo=lv=-56;cid=CID-373d38e6-6e88-486e-a88e-db76492ae751;nsid=1321340415;c=0),
> blocks: 1, hasStaleStorages: false, processing time: 1 msecs
> 2015-02-04 01:40:18,478 INFO blockmanagement.BlockManager
> (BlockManager.java:processReport(1815)) - BLOCK* processReport: Received
> first block report from
> DatanodeStorage[DS-86763f94-9f8c-41ac-be47-4f4b84b32077,DISK,NORMAL] after
> starting up or becoming active. Its block contents are no longer considered
> stale
> 2015-02-04 01:40:18,480 INFO BlockStateChange
> (BlockManager.java:processReport(1831)) - BLOCK* processReport: from storage
> DS-86763f94-9f8c-41ac-be47-4f4b84b32077 node
> DatanodeRegistration(192.168.64.102,
> datanodeUuid=dda574c3-f296-4b95-ab56-5df231179356, infoPort=50075,
> ipcPort=8010,
> storageInfo=lv=-56;cid=CID-373d38e6-6e88-486e-a88e-db76492ae751;nsid=1321340415;c=0),
> blocks: 1, hasStaleStorages: false, processing time: 0 msecs
> 2015-02-04 01:40:29,389 INFO ipc.Server (Server.java:run(2053)) - IPC Server
> handler 88 on 8020, call
> org.apache.hadoop.hdfs.protocol.ClientProtocol.rollEdits from
> 192.168.64.103:45722 Call#0 Retry#0: org.apache.hadoop.ipc.StandbyException:
> Operation category JOURNAL is not supported in state standby
> 2015-02-04 01:40:30,019 INFO ipc.Server (Server.java:run(2053)) - IPC Server
> handler 78 on 8020, call
> org.apache.hadoop.hdfs.protocol.ClientProtocol.rollEdits from
> 192.168.64.103:45722 Call#0 Retry#2: org.apache.hadoop.ipc.StandbyException:
> Operation category JOURNAL is not supported in state standby
> 2015-02-04 01:40:31,382 INFO namenode.FSNamesystem
> (FSNamesystem.java:stopStandbyServices(1336)) - Stopping services started for
> standby state
> 2015-02-04 01:40:31,382 WARN ha.EditLogTailer
> (EditLogTailer.java:doWork(339)) - Edit log tailer interrupted
> java.lang.InterruptedException: sleep interrupted
> at java.lang.Thread.sleep(Native Method)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:337)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)
> at
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:412)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)
> 2015-02-04 01:40:31,387 INFO namenode.FSNamesystem
> (FSNamesystem.java:startActiveServices(1142)) - Starting services required
> for active state
> {code}
> The ZKFC logs show how the unexpected failover happened,
> On nn1, which transitioned from standby to active:
> {code}
> 2015-02-04 01:39:45,173 INFO ha.ZKFailoverController
> (ZKFailoverController.java:becomeStandby(482)) - ZK Election indicated that
> NameNode at c6401.ambari.apache.org/192.168.64.101:8020 should become standby
> 2015-02-04 01:39:45,175 INFO ha.ZKFailoverController
> (ZKFailoverController.java:becomeStandby(487)) - Successfully transitioned
> NameNode at c6401.ambari.apache.org/192.168.64.101:8020 to standby state
> 2015-02-04 01:40:30,304 INFO ha.ActiveStandbyElector
> (ActiveStandbyElector.java:fenceOldActive(877)) - Checking for any old active
> which needs to be fenced...
> 2015-02-04 01:40:30,308 INFO ha.ActiveStandbyElector
> (ActiveStandbyElector.java:fenceOldActive(898)) - Old node exists:
> 0a02686112036e6e321a1763363430322e616d626172692e6170616368652e6f726720d43e28d33e
> 2015-02-04 01:40:30,310 INFO ha.ZKFailoverController
> (ZKFailoverController.java:doFence(511)) - Should fence: NameNode at
> c6402.ambari.apache.org/192.168.64.102:8020
> 2015-02-04 01:40:31,312 INFO ipc.Client
> (Client.java:handleConnectionFailure(858)) - Retrying connect to server:
> c6402.ambari.apache.org/192.168.64.102:8020. Already tried 0 time(s); retry
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000
> MILLISECONDS)
> 2015-02-04 01:40:31,314 WARN ha.FailoverController
> (FailoverController.java:tryGracefulFence(178)) - Unable to gracefully make
> NameNode at c6402.ambari.apache.org/192.168.64.102:8020 standby (unable to
> connect)
> java.net.ConnectException: Call From c6401.ambari.apache.org/192.168.64.101
> to c6402.ambari.apache.org:8020 failed on connection exception:
> java.net.ConnectException: Connection refused; For more details see:
> http://wiki.apache.org/hadoop/ConnectionRefused
> at sun.reflect.GeneratedConstructorAccessor8.newInstance(Unknown
> Source)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy12.transitionToStandby(Unknown Source)
> at
> org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112)
> at
> org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)
> at
> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:512)
> at
> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:503)
> at
> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
> at
> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:890)
> at
> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:902)
> at
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:801)
> at
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416)
> at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
> at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> Caused by: java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
> at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
> at
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607)
> at
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705)
> at
> org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
> at org.apache.hadoop.ipc.Client.call(Client.java:1438)
> ... 14 more
> 2015-02-04 01:40:31,319 INFO ha.NodeFencer (NodeFencer.java:fence(91)) -
> ====== Beginning Service Fencing Process... ======
> 2015-02-04 01:40:31,319 INFO ha.NodeFencer (NodeFencer.java:fence(94)) -
> Trying method 1/1: org.apache.hadoop.ha.ShellCommandFencer(/bin/true)
> 2015-02-04 01:40:31,357 INFO ha.ShellCommandFencer
> (ShellCommandFencer.java:tryFence(99)) - Launched fencing command '/bin/true'
> with pid 14045
> 2015-02-04 01:40:31,364 INFO ha.NodeFencer (NodeFencer.java:fence(98)) -
> ====== Fencing successful by method
> org.apache.hadoop.ha.ShellCommandFencer(/bin/true) ======
> 2015-02-04 01:40:31,364 INFO ha.ActiveStandbyElector
> (ActiveStandbyElector.java:writeBreadCrumbNode(824)) - Writing znode
> /hadoop-ha/ha/ActiveBreadCrumb to indicate that the local node is the most
> recent active...
> 2015-02-04 01:40:31,369 INFO ha.ZKFailoverController
> (ZKFailoverController.java:becomeActive(378)) - Trying to make NameNode at
> c6401.ambari.apache.org/192.168.64.101:8020 active...
> 2015-02-04 01:40:32,821 INFO ha.ZKFailoverController
> (ZKFailoverController.java:becomeActive(385)) - Successfully transitioned
> NameNode at c6401.ambari.apache.org/192.168.64.101:8020 to active state
> {code}
> On nn2, which transitioned from active to standby:
> {code}
> 2015-02-04 01:39:46,270 INFO ha.ActiveStandbyElector
> (ActiveStandbyElector.java:processWatchEvent(547)) - Session connected.
> 2015-02-04 01:39:46,279 INFO ha.ActiveStandbyElector
> (ActiveStandbyElector.java:fenceOldActive(877)) - Checking for any old active
> which needs to be fenced...
> 2015-02-04 01:39:46,282 INFO ha.ActiveStandbyElector
> (ActiveStandbyElector.java:fenceOldActive(898)) - Old node exists:
> 0a02686112036e6e321a1763363430322e616d626172692e6170616368652e6f726720d43e28d33e
> 2015-02-04 01:39:46,282 INFO ha.ActiveStandbyElector
> (ActiveStandbyElector.java:fenceOldActive(900)) - But old node has our own
> data, so don't need to fence it.
> 2015-02-04 01:39:46,282 INFO ha.ActiveStandbyElector
> (ActiveStandbyElector.java:writeBreadCrumbNode(824)) - Writing znode
> /hadoop-ha/ha/ActiveBreadCrumb to indicate that the local node is the most
> recent active...
> 2015-02-04 01:39:46,289 INFO ha.ZKFailoverController
> (ZKFailoverController.java:becomeActive(378)) - Trying to make NameNode at
> c6402.ambari.apache.org/192.168.64.102:8020 active...
> 2015-02-04 01:39:46,301 INFO ha.ZKFailoverController
> (ZKFailoverController.java:becomeActive(385)) - Successfully transitioned
> NameNode at c6402.ambari.apache.org/192.168.64.102:8020 to active state
> 2015-02-04 01:40:30,302 WARN ha.HealthMonitor
> (HealthMonitor.java:doHealthChecks(209)) - Transport-level exception trying
> to monitor health of NameNode at c6402.ambari.apache.org/192.168.64.102:8020:
> End of File Exception between local host is:
> "c6402.ambari.apache.org/192.168.64.102"; destination host is:
> "c6402.ambari.apache.org":8020; : java.io.EOFException; For more details see:
> http://wiki.apache.org/hadoop/EOFException
> 2015-02-04 01:40:30,302 INFO ha.HealthMonitor
> (HealthMonitor.java:enterState(238)) - Entering state SERVICE_NOT_RESPONDING
> 2015-02-04 01:40:30,302 INFO ha.ZKFailoverController
> (ZKFailoverController.java:setLastHealthState(850)) - Local service NameNode
> at c6402.ambari.apache.org/192.168.64.102:8020 entered state:
> SERVICE_NOT_RESPONDING
> 2015-02-04 01:40:30,302 INFO ha.ZKFailoverController
> (ZKFailoverController.java:recheckElectability(766)) - Quitting master
> election for NameNode at c6402.ambari.apache.org/192.168.64.102:8020 and
> marking that fencing is necessary
> 2015-02-04 01:40:30,303 INFO ha.ActiveStandbyElector
> (ActiveStandbyElector.java:quitElection(354)) - Yielding from election
> 2015-02-04 01:40:30,310 INFO zookeeper.ZooKeeper (ZooKeeper.java:close(684))
> - Session: 0x24b523c16a70000 closed
> 2015-02-04 01:40:30,310 WARN ha.ActiveStandbyElector
> (ActiveStandbyElector.java:isStaleClient(1006)) - Ignoring stale result from
> old client with sessionId 0x24b523c16a70000
> 2015-02-04 01:40:30,310 INFO zookeeper.ClientCnxn (ClientCnxn.java:run(512))
> - EventThread shut down
> 2015-02-04 01:40:32,314 INFO ipc.Client
> (Client.java:handleConnectionFailure(858)) - Retrying connect to server:
> c6402.ambari.apache.org/192.168.64.102:8020. Already tried 0 time(s); retry
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000
> MILLISECONDS)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)