Rafal Wojdyla created HDFS-6179:
-----------------------------------
Summary: Synchronized
Key: HDFS-6179
URL: https://issues.apache.org/jira/browse/HDFS-6179
Project: Hadoop HDFS
Issue Type: Bug
Components: datanode, namenode
Affects Versions: 2.2.0
Reporter: Rafal Wojdyla
Scenario:
* 600 ative DNs
* 1 *active* NN
* HA configuration
When we start SbNN because of huge number of blocks and relative small
initialDelay - SbNN during startup will go through multiple stop-the-world
garbage collections processes (in minutes - Namenode heap size is 75GB). We've
observed that SbNN slowness affects active NN so active NN is losing DNs (DNs
are considered dead due to lack of heartbeats). We assume that some DNs are
hanging.
When DN is considered dead by active Namenode, we've observed "dead lock" in DN
process, part of stack trace:
{noformat}
"DataNode: [file:/disk1,file:/disk2] heartbeating to
standbynamenode.net/10.10.10.10:8020" daemon prio=10 tid=0x00007ff429417800
nid=0x7f2a in Object.wait() [0x00007ff42122c000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:485)
at org.apache.hadoop.ipc.Client.call(Client.java:1333)
- locked <0x00000007db94e4c8> (a org.apache.hadoop.ipc.Client$Call)
at org.apache.hadoop.ipc.Client.call(Client.java:1300)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at $Proxy9.registerDatanode(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at $Proxy9.registerDatanode(Unknown Source)
at
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:146)
at
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:623)
at
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:740)
at
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromStandby(BPOfferService.java:603)
at
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:506)
- locked <0x0000000780006e08> (a
org.apache.hadoop.hdfs.server.datanode.BPOfferService)
at
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:704)
at
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:539)
at
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676)
at java.lang.Thread.run(Thread.java:662)
"DataNode: [file:/disk1,file:/disk2] heartbeating to
activenamenode.net/10.10.10.11:8020" daemon prio=10 tid=0x00007ff428a24000
nid=0x7f29 waiting for monitor entry [0x00007ff42132e000]
java.lang.Thread.State: BLOCKED (on object monitor)
at
org.apache.hadoop.hdfs.server.datanode.BPOfferService.updateActorStatesFromHeartbeat(BPOfferService.java:413)
- waiting to lock <0x0000000780006e08> (a
org.apache.hadoop.hdfs.server.datanode.BPOfferService)
at
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:535)
at
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676)
at java.lang.Thread.run(Thread.java:662)
{noformat}
Notice that it's the same lock - due to synchronization at BPOfferService. The
problem is that command from standby can't be process due to unresponsive
standby Namenode, nevertheless DN is waiting for reply from SbNN, and is
waiting long enough to be considered dead by active namenode.
Info: if we kill SbNN, DN will instantly reconnect to active NN.
--
This message was sent by Atlassian JIRA
(v6.2#6252)