[jira] [Updated] (AMBARI-9467) RU Hacks and Technical Debt - Namenode order of active/standby in code is flipped

Alejandro Fernandez (JIRA) Tue, 03 Feb 2015 18:50:17 -0800

     [ 
https://issues.apache.org/jira/browse/AMBARI-9467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alejandro Fernandez updated AMBARI-9467:
----------------------------------------
    Attachment: AMBARI-9467.patch

> RU Hacks and Technical Debt - Namenode order of active/standby in code is 
> flipped
> ---------------------------------------------------------------------------------
>
>                 Key: AMBARI-9467
>                 URL: https://issues.apache.org/jira/browse/AMBARI-9467
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-server
>    Affects Versions: 2.0.0
>            Reporter: Alejandro Fernandez
>            Assignee: Alejandro Fernandez
>             Fix For: 2.0.0
>
>         Attachments: AMBARI-9467.patch
>
>
> Search for comments like # TODO, // TODO, # HACK, or // HACK
> UpgradeHelper somehow calls the active Namenode first, but this ends up being 
> the standby namenode by the time it gets called; investigate why.
> It appears that restarting JN causes the namenodes to flip roles between 
> active and standby
> Journal node restart:
> {code}
> 2015-02-04 01:40:21,682 - Executing Rolling Upgrade post-restart
> 2015-02-04 01:40:21,684 - Ensuring Journalnode quorum is established
> 2015-02-04 01:40:21,692 - Namenode HA State: {
> IDs: nn1, nn2
> Addresses: {u'nn2': (u'c6402.ambari.apache.org:50070', 
> u'c6402.ambari.apache.org:50470'), u'nn1': (u'c6401.ambari.apache.org:50070', 
> u'c6401.ambari.apache.org:50470')}
> States: {u'standby': set([u'c6401.ambari.apache.org']), u'active': 
> set([u'c6402.ambari.apache.org'])}
> Encrypted: False
> Healthy: True
> }
> 2015-02-04 01:40:26,697 - u"Execute['hdfs dfsadmin -rollEdits']" {'tries': 1, 
> 'user': 'hdfs'}
> 2015-02-04 01:40:41,490 - Could not get namenode state for nn2
> 2015-02-04 01:40:41,492 - Namenode HA State: {
> IDs: nn1, nn2
> Addresses: {u'nn2': (u'c6402.ambari.apache.org:50070', 
> u'c6402.ambari.apache.org:50470'), u'nn1': (u'c6401.ambari.apache.org:50070', 
> u'c6401.ambari.apache.org:50470')}
> States: {u'active': set([u'c6401.ambari.apache.org'])}
> Encrypted: False
> Healthy: False
> }
> {code}
> At around 2015-02-04 01:40:26, the Namenode logs on nn2 show,
> {code}
> 2015-02-04 01:40:13,217 INFO  blockmanagement.CacheReplicationMonitor 
> (CacheReplicationMonitor.java:run(201)) - Scanned 0 directive(s) and 0 
> block(s) in 0 millisecond(s).
> 2015-02-04 01:40:29,454 INFO  namenode.FSNamesystem 
> (FSNamesystem.java:rollEditLog(6345)) - Roll Edit Log from 192.168.64.103
> 2015-02-04 01:40:29,454 INFO  namenode.FSEditLog 
> (FSEditLog.java:rollEditLog(1157)) - Rolling edit logs
> 2015-02-04 01:40:29,454 INFO  namenode.FSEditLog 
> (FSEditLog.java:endCurrentLogSegment(1214)) - Ending log segment 34
> 2015-02-04 01:40:29,454 INFO  namenode.FSEditLog 
> (FSEditLog.java:printStatistics(691)) - Number of transactions: 2 Total time 
> for transactions(ms): 2 Number of transactions batched in Syncs: 0 Number of 
> syncs: 1 SyncTimes(ms): 8 4
> 2015-02-04 01:40:29,465 WARN  client.QuorumJournalManager 
> (IPCLoggerChannel.java:call(388)) - Remote journal 192.168.64.101:8485 failed 
> to write txns 35-35. Will try to write to this JN again after the next log 
> roll.
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 3 is 
> not the current writer epoch  2
>         at 
> org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:445)
>         at 
> org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:342)
>         at 
> org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
>         at 
> org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
>         at 
> org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1468)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>         at com.sun.proxy.$Proxy12.journal(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)
>         at 
> org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385)
>         at 
> org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> So the namenodes flipped order.
> Also, nn1 shows,
> {code}
> 2015-02-04 01:39:06,500 INFO  blockmanagement.BlockManager 
> (BlockManager.java:processReport(1815)) - BLOCK* processReport: Received 
> first block report from 
> DatanodeStorage[DS-9188ccbc-f03f-48f3-9314-ca6475d846f8,DISK,NORMAL] after 
> starting up or becoming active. Its block contents are no longer considered 
> stale
> 2015-02-04 01:39:06,500 INFO  BlockStateChange 
> (BlockManager.java:processReport(1831)) - BLOCK* processReport: from storage 
> DS-9188ccbc-f03f-48f3-9314-ca6475d846f8 node 
> DatanodeRegistration(192.168.64.103, 
> datanodeUuid=54053f93-435e-4fce-950f-00d4deff1506, infoPort=50075, 
> ipcPort=8010, 
> storageInfo=lv=-56;cid=CID-373d38e6-6e88-486e-a88e-db76492ae751;nsid=1321340415;c=0),
>  blocks: 1, hasStaleStorages: false, processing time: 16 msecs
> 2015-02-04 01:39:20,978 WARN  namenode.FSNamesystem 
> (FSNamesystem.java:getCorruptFiles(7875)) - Get corrupt file blocks returned 
> error: Operation category READ is not supported in state standby
> 2015-02-04 01:39:20,978 WARN  namenode.FSNamesystem 
> (FSNamesystem.java:getCorruptFiles(7875)) - Get corrupt file blocks returned 
> error: Operation category READ is not supported in state standby
> 2015-02-04 01:39:49,635 WARN  namenode.FSNamesystem 
> (FSNamesystem.java:getCorruptFiles(7875)) - Get corrupt file blocks returned 
> error: Operation category READ is not supported in state standby
> 2015-02-04 01:39:49,655 WARN  namenode.FSNamesystem 
> (FSNamesystem.java:getCorruptFiles(7875)) - Get corrupt file blocks returned 
> error: Operation category READ is not supported in state standby
> 2015-02-04 01:39:51,476 INFO  blockmanagement.BlockManager 
> (BlockManager.java:processReport(1815)) - BLOCK* processReport: Received 
> first block report from 
> DatanodeStorage[DS-ca53d96a-7630-475f-a502-47b59aba2ee0,DISK,NORMAL] after 
> starting up or becoming active. Its block contents are no longer considered 
> stale
> 2015-02-04 01:39:51,477 INFO  BlockStateChange 
> (BlockManager.java:processReport(1831)) - BLOCK* processReport: from storage 
> DS-ca53d96a-7630-475f-a502-47b59aba2ee0 node 
> DatanodeRegistration(192.168.64.101, 
> datanodeUuid=faff4a42-fbe3-48d1-9592-4ef1f4fe077d, infoPort=50075, 
> ipcPort=8010, 
> storageInfo=lv=-56;cid=CID-373d38e6-6e88-486e-a88e-db76492ae751;nsid=1321340415;c=0),
>  blocks: 1, hasStaleStorages: false, processing time: 1 msecs
> 2015-02-04 01:40:18,478 INFO  blockmanagement.BlockManager 
> (BlockManager.java:processReport(1815)) - BLOCK* processReport: Received 
> first block report from 
> DatanodeStorage[DS-86763f94-9f8c-41ac-be47-4f4b84b32077,DISK,NORMAL] after 
> starting up or becoming active. Its block contents are no longer considered 
> stale
> 2015-02-04 01:40:18,480 INFO  BlockStateChange 
> (BlockManager.java:processReport(1831)) - BLOCK* processReport: from storage 
> DS-86763f94-9f8c-41ac-be47-4f4b84b32077 node 
> DatanodeRegistration(192.168.64.102, 
> datanodeUuid=dda574c3-f296-4b95-ab56-5df231179356, infoPort=50075, 
> ipcPort=8010, 
> storageInfo=lv=-56;cid=CID-373d38e6-6e88-486e-a88e-db76492ae751;nsid=1321340415;c=0),
>  blocks: 1, hasStaleStorages: false, processing time: 0 msecs
> 2015-02-04 01:40:29,389 INFO  ipc.Server (Server.java:run(2053)) - IPC Server 
> handler 88 on 8020, call 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.rollEdits from 
> 192.168.64.103:45722 Call#0 Retry#0: org.apache.hadoop.ipc.StandbyException: 
> Operation category JOURNAL is not supported in state standby
> 2015-02-04 01:40:30,019 INFO  ipc.Server (Server.java:run(2053)) - IPC Server 
> handler 78 on 8020, call 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.rollEdits from 
> 192.168.64.103:45722 Call#0 Retry#2: org.apache.hadoop.ipc.StandbyException: 
> Operation category JOURNAL is not supported in state standby
> 2015-02-04 01:40:31,382 INFO  namenode.FSNamesystem 
> (FSNamesystem.java:stopStandbyServices(1336)) - Stopping services started for 
> standby state
> 2015-02-04 01:40:31,382 WARN  ha.EditLogTailer 
> (EditLogTailer.java:doWork(339)) - Edit log tailer interrupted
> java.lang.InterruptedException: sleep interrupted
>         at java.lang.Thread.sleep(Native Method)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:337)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)
>         at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:412)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)
> 2015-02-04 01:40:31,387 INFO  namenode.FSNamesystem 
> (FSNamesystem.java:startActiveServices(1142)) - Starting services required 
> for active state
> {code}
> The ZKFC logs show how the unexpected failover happened,
> On nn1, which transitioned from standby to active:
> {code}
> 2015-02-04 01:39:45,173 INFO  ha.ZKFailoverController 
> (ZKFailoverController.java:becomeStandby(482)) - ZK Election indicated that 
> NameNode at c6401.ambari.apache.org/192.168.64.101:8020 should become standby
> 2015-02-04 01:39:45,175 INFO  ha.ZKFailoverController 
> (ZKFailoverController.java:becomeStandby(487)) - Successfully transitioned 
> NameNode at c6401.ambari.apache.org/192.168.64.101:8020 to standby state
> 2015-02-04 01:40:30,304 INFO  ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:fenceOldActive(877)) - Checking for any old active 
> which needs to be fenced...
> 2015-02-04 01:40:30,308 INFO  ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:fenceOldActive(898)) - Old node exists: 
> 0a02686112036e6e321a1763363430322e616d626172692e6170616368652e6f726720d43e28d33e
> 2015-02-04 01:40:30,310 INFO  ha.ZKFailoverController 
> (ZKFailoverController.java:doFence(511)) - Should fence: NameNode at 
> c6402.ambari.apache.org/192.168.64.102:8020
> 2015-02-04 01:40:31,312 INFO  ipc.Client 
> (Client.java:handleConnectionFailure(858)) - Retrying connect to server: 
> c6402.ambari.apache.org/192.168.64.102:8020. Already tried 0 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 
> MILLISECONDS)
> 2015-02-04 01:40:31,314 WARN  ha.FailoverController 
> (FailoverController.java:tryGracefulFence(178)) - Unable to gracefully make 
> NameNode at c6402.ambari.apache.org/192.168.64.102:8020 standby (unable to 
> connect)
> java.net.ConnectException: Call From c6401.ambari.apache.org/192.168.64.101 
> to c6402.ambari.apache.org:8020 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>         at sun.reflect.GeneratedConstructorAccessor8.newInstance(Unknown 
> Source)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>         at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1472)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>         at com.sun.proxy.$Proxy12.transitionToStandby(Unknown Source)
>         at 
> org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112)
>         at 
> org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)
>         at 
> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:512)
>         at 
> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:503)
>         at 
> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
>         at 
> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:890)
>         at 
> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:902)
>         at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:801)
>         at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416)
>         at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
>         at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> Caused by: java.net.ConnectException: Connection refused
>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>         at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705)
>         at 
> org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1438)
>         ... 14 more
> 2015-02-04 01:40:31,319 INFO  ha.NodeFencer (NodeFencer.java:fence(91)) - 
> ====== Beginning Service Fencing Process... ======
> 2015-02-04 01:40:31,319 INFO  ha.NodeFencer (NodeFencer.java:fence(94)) - 
> Trying method 1/1: org.apache.hadoop.ha.ShellCommandFencer(/bin/true)
> 2015-02-04 01:40:31,357 INFO  ha.ShellCommandFencer 
> (ShellCommandFencer.java:tryFence(99)) - Launched fencing command '/bin/true' 
> with pid 14045
> 2015-02-04 01:40:31,364 INFO  ha.NodeFencer (NodeFencer.java:fence(98)) - 
> ====== Fencing successful by method 
> org.apache.hadoop.ha.ShellCommandFencer(/bin/true) ======
> 2015-02-04 01:40:31,364 INFO  ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:writeBreadCrumbNode(824)) - Writing znode 
> /hadoop-ha/ha/ActiveBreadCrumb to indicate that the local node is the most 
> recent active...
> 2015-02-04 01:40:31,369 INFO  ha.ZKFailoverController 
> (ZKFailoverController.java:becomeActive(378)) - Trying to make NameNode at 
> c6401.ambari.apache.org/192.168.64.101:8020 active...
> 2015-02-04 01:40:32,821 INFO  ha.ZKFailoverController 
> (ZKFailoverController.java:becomeActive(385)) - Successfully transitioned 
> NameNode at c6401.ambari.apache.org/192.168.64.101:8020 to active state
> {code}
> On nn2, which transitioned from active to standby:
> {code}
> 2015-02-04 01:39:46,270 INFO  ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:processWatchEvent(547)) - Session connected.
> 2015-02-04 01:39:46,279 INFO  ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:fenceOldActive(877)) - Checking for any old active 
> which needs to be fenced...
> 2015-02-04 01:39:46,282 INFO  ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:fenceOldActive(898)) - Old node exists: 
> 0a02686112036e6e321a1763363430322e616d626172692e6170616368652e6f726720d43e28d33e
> 2015-02-04 01:39:46,282 INFO  ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:fenceOldActive(900)) - But old node has our own 
> data, so don't need to fence it.
> 2015-02-04 01:39:46,282 INFO  ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:writeBreadCrumbNode(824)) - Writing znode 
> /hadoop-ha/ha/ActiveBreadCrumb to indicate that the local node is the most 
> recent active...
> 2015-02-04 01:39:46,289 INFO  ha.ZKFailoverController 
> (ZKFailoverController.java:becomeActive(378)) - Trying to make NameNode at 
> c6402.ambari.apache.org/192.168.64.102:8020 active...
> 2015-02-04 01:39:46,301 INFO  ha.ZKFailoverController 
> (ZKFailoverController.java:becomeActive(385)) - Successfully transitioned 
> NameNode at c6402.ambari.apache.org/192.168.64.102:8020 to active state
> 2015-02-04 01:40:30,302 WARN  ha.HealthMonitor 
> (HealthMonitor.java:doHealthChecks(209)) - Transport-level exception trying 
> to monitor health of NameNode at c6402.ambari.apache.org/192.168.64.102:8020: 
> End of File Exception between local host is: 
> "c6402.ambari.apache.org/192.168.64.102"; destination host is: 
> "c6402.ambari.apache.org":8020; : java.io.EOFException; For more details see: 
>  http://wiki.apache.org/hadoop/EOFException
> 2015-02-04 01:40:30,302 INFO  ha.HealthMonitor 
> (HealthMonitor.java:enterState(238)) - Entering state SERVICE_NOT_RESPONDING
> 2015-02-04 01:40:30,302 INFO  ha.ZKFailoverController 
> (ZKFailoverController.java:setLastHealthState(850)) - Local service NameNode 
> at c6402.ambari.apache.org/192.168.64.102:8020 entered state: 
> SERVICE_NOT_RESPONDING
> 2015-02-04 01:40:30,302 INFO  ha.ZKFailoverController 
> (ZKFailoverController.java:recheckElectability(766)) - Quitting master 
> election for NameNode at c6402.ambari.apache.org/192.168.64.102:8020 and 
> marking that fencing is necessary
> 2015-02-04 01:40:30,303 INFO  ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:quitElection(354)) - Yielding from election
> 2015-02-04 01:40:30,310 INFO  zookeeper.ZooKeeper (ZooKeeper.java:close(684)) 
> - Session: 0x24b523c16a70000 closed
> 2015-02-04 01:40:30,310 WARN  ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:isStaleClient(1006)) - Ignoring stale result from 
> old client with sessionId 0x24b523c16a70000
> 2015-02-04 01:40:30,310 INFO  zookeeper.ClientCnxn (ClientCnxn.java:run(512)) 
> - EventThread shut down
> 2015-02-04 01:40:32,314 INFO  ipc.Client 
> (Client.java:handleConnectionFailure(858)) - Retrying connect to server: 
> c6402.ambari.apache.org/192.168.64.102:8020. Already tried 0 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 
> MILLISECONDS)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (AMBARI-9467) RU Hacks and Technical Debt - Namenode order of active/standby in code is flipped

Reply via email to