[ 
https://issues.apache.org/jira/browse/HDFS-6179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rafal Wojdyla updated HDFS-6179:
--------------------------------

    Summary: Synchronized BPOfferService - datanode locks for slow namenode 
reply.  (was: Synchronized )

> Synchronized BPOfferService - datanode locks for slow namenode reply.
> ---------------------------------------------------------------------
>
>                 Key: HDFS-6179
>                 URL: https://issues.apache.org/jira/browse/HDFS-6179
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, namenode
>    Affects Versions: 2.2.0
>            Reporter: Rafal Wojdyla
>
> Scenario:
> * 600 ative DNs
> * 1 *active* NN
> * HA configuration
> When we start SbNN because of huge number of blocks and relative small 
> initialDelay - SbNN during startup will go through multiple stop-the-world 
> garbage collections processes (in minutes - Namenode heap size is 75GB). 
> We've observed that SbNN slowness affects active NN so active NN is losing 
> DNs (DNs are considered dead due to lack of heartbeats). We assume that some 
> DNs are hanging.
> When DN is considered dead by active Namenode, we've observed "dead lock" in 
> DN process, part of stack trace:
> {noformat}
> "DataNode: [file:/disk1,file:/disk2]  heartbeating to 
> standbynamenode.net/10.10.10.10:8020" daemon prio=10 tid=0x00007ff429417800 
> nid=0x7f2a in Object.wait() [0x00007ff42122c000]
>    java.lang.Thread.State: WAITING (on object monitor)
>         at java.lang.Object.wait(Native Method)
>         at java.lang.Object.wait(Object.java:485)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1333)
>         - locked <0x00000007db94e4c8> (a org.apache.hadoop.ipc.Client$Call)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1300)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>         at $Proxy9.registerDatanode(Unknown Source)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>         at $Proxy9.registerDatanode(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:146)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:623)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:740)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromStandby(BPOfferService.java:603)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:506)
>         - locked <0x0000000780006e08> (a 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:704)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:539)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676)
>         at java.lang.Thread.run(Thread.java:662)
> "DataNode: [file:/disk1,file:/disk2]  heartbeating to 
> activenamenode.net/10.10.10.11:8020" daemon prio=10 tid=0x00007ff428a24000 
> nid=0x7f29 waiting for monitor entry [0x00007ff42132e000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.updateActorStatesFromHeartbeat(BPOfferService.java:413)
>         - waiting to lock <0x0000000780006e08> (a 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:535)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676)
>         at java.lang.Thread.run(Thread.java:662)
> {noformat}
> Notice that it's the same lock - due to synchronization at BPOfferService. 
> The problem is that command from standby can't be process due to unresponsive 
> standby Namenode, nevertheless DN is waiting for reply from SbNN, and is 
> waiting long enough to be considered dead by active namenode.
> Info: if we kill SbNN, DN will instantly reconnect to active NN.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to