[ 
https://issues.apache.org/jira/browse/HDFS-15131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen O'Donnell updated HDFS-15131:
-------------------------------------
          Component/s: namenode
                       datanode
     Target Version/s: 3.3.0
    Affects Version/s: 3.3.0
          Description: 
We have seen some occurrences of the Namenode getting very slow on delete 
operations, to the point where IBRs get blocked frequently and files fail to 
close. On one cluster in particular, after about 4 weeks of uptime, the 
Namenode started responding very poorly. Restarting it corrected the problem 
for another 4 weeks. 

In that example, jstacks in the namenode always pointed to slow operations 
around a HDFS delete call which was performing an operation on the 
FoldedTreeSet structure. The captured jstacks always pointed at an operation on 
the folded tree set each time they were sampled:

{code}
"IPC Server handler 573 on 8020" #663 daemon prio=5 os_prio=0 
tid=0x00007fe6a4087800 nid=0x97a6 runnable [0x00007fe67bdfd000]
   java.lang.Thread.State: RUNNABLE
        at 
org.apache.hadoop.hdfs.util.FoldedTreeSet.removeAndGet(FoldedTreeSet.java:879)
        at 
org.apache.hadoop.hdfs.util.FoldedTreeSet.remove(FoldedTreeSet.java:911)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeStorageInfo.removeBlock(DatanodeStorageInfo.java:263)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.removeBlock(BlocksMap.java:194)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.removeBlock(BlocksMap.java:108)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.removeBlockFromMap(BlockManager.java:3676)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.removeBlock(BlockManager.java:3507)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.removeBlocks(FSNamesystem.java:4158)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:4132)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInt(FSNamesystem.java:4069)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:4053)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:845)
        at 
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.delete(AuthorizationProviderProxyClientProtocol.java:308)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:603)
        at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2216)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2212)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
{code}

The observation in this case, was that the namenode worked fine after a restart 
and then at some point after about 4 weeks of uptime, this problem started 
happening, and it would persist until the namenode was restarted. Then the 
problem did not return for about another 4 weeks.

On a completely different cluster and version, I recently came across a problem 
where files were again failing to close (last block does not have sufficient 
number of replicas) and the datanodes were logging a lot of messages like the 
following:

{code}
2019-11-27 09:00:49,678 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Took 21540ms to process 1 commands from NN
{code}

These messages had a range of durations and were fairly frequent. Focusing on 
the longer messages at around 20 seconds and checking a few different 
occurrences, jstacks showed as very similar problem to the NN issue, in that 
the folderTreeSet seems to be at the root of the problem.

The heartbeat thread is waiting on a lock:

{code}
"Thread-26" #243 daemon prio=5 os_prio=0 tid=0x00007f575225a000 nid=0x1d8bae 
waiting for monitor entry [0x00007f571ed76000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2004)
    - waiting to lock <0x00000006b79baf00> (a 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
    at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:699)
    at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:645)
    at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:730)
    at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:539)
    at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:696)
    at java.lang.Thread.run(Thread.java:748)
{code}

What seemed to be blocking it is:

{code}
"DataXceiver for client DFSClient_NONMAPREDUCE_672634602_52 at 
/10.71.224.51:37620 [Sending block 
BP-551684957-10.71.224.13-1505987946791:blk_1683565334_609907893]" #142018893 
daemon prio=5 os_prio=0 tid=0x00007f573cd25000 nid=0x1912a runnable 
[0x00007f570acea000]
   java.lang.Thread.State: RUNNABLE
    at org.apache.hadoop.hdfs.util.FoldedTreeSet.get(FoldedTreeSet.java:449)
    at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaMap.get(ReplicaMap.java:111)
    - locked <0x00000006b79baf00> (a 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
    at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getFile(FsDatasetImpl.java:2199)
    at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.validateBlockFile(FsDatasetImpl.java:1954)
    - locked <0x00000006b79baf00> (a 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
    at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockFile(FsDatasetImpl.java:709)
    at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockFile(FsDatasetImpl.java:702)
    at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getMetaDataInputStream(FsDatasetImpl.java:220)
    at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:301)
    at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:537)
    at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:148)
    at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
    at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246)
    at java.lang.Thread.run(Thread.java:748)
{code}

Out of curiosity, we restarted 3 of the DNs, and 24 hours since the restart, no 
messages like the above were logged, but they were getting logged at a 
reasonable frequency before the restart.

I have quite a few jstacks at 5 second frequency for some of the DNs on this 
cluster and many samples have a runnable thread with this line at top of the 
stack, which seems too frequent for a get operation:

{code}

at org.apache.hadoop.hdfs.util.FoldedTreeSet.get(FoldedTreeSet.java:449)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaMap.get(ReplicaMap.java:111)
- locked <0x00000006b79baf00> (a 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
{code}

After the restart, the jstacks never show this line in any samples for at least 
the next week while were were monitoring the cluster.

Another interesting point, is that the JStack always shows line 449. This is 
the method it is running and the line is highlighted:

{code}
  public E get(Object obj, Comparator<?> cmp) {
    Objects.requireNonNull(obj);

    Node<E> node = root;
    while (node != null) {
      E[] entries = node.entries;

      int leftIndex = node.leftIndex;
      int result = compare(obj, entries[leftIndex], cmp);
      if (result < 0) {
        node = node.left;
      } else if (result == 0) {
        return entries[leftIndex];
      } else {
        int rightIndex = node.rightIndex;
        if (leftIndex != rightIndex) {
          result = compare(obj, entries[rightIndex], cmp);
        }
        if (result == 0) {
          return entries[rightIndex];
        } else if (result > 0) {
          node = node.right;
        } else {
          int low = leftIndex + 1;
          int high = rightIndex - 1;
          while (low <= high) {
            int mid = (low + high) >>> 1;
            result = compare(obj, entries[mid], cmp);
            if (result > 0) {
              low = mid + 1;
            } else if (result < 0) {
              high = mid - 1;
            } else {
              return entries[mid];
            }
          }
          return null;
        }
      }
    } // *** This is line 449 which the jstack always has at the top of the 
stack.
    return null;
  }
{code}

I cannot explain why that line is always the "running line" except if its 
somehow related to safepoints in the JVM.

Based on these two examples and a few other similar issues I have encountered 
in the namenode, I believe we have an issue with foldedTreeSet where it somehow 
degrades over time, to the point that a simple get operation takes a relatively 
long time. It does not seem to be able to recover from this with a process 
restart.

HDFS-13671 talks about reverting FoldedTreeSet (at least in the Namenode) for 
performance reasons. However I wonder if the problems people have seen with it 
are related to this performance degradation over time rather than the initial 
performance?


> FoldedTreeSet appears to degrade over time
> ------------------------------------------
>
>                 Key: HDFS-15131
>                 URL: https://issues.apache.org/jira/browse/HDFS-15131
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, namenode
>    Affects Versions: 3.3.0
>            Reporter: Stephen O'Donnell
>            Priority: Major
>
> We have seen some occurrences of the Namenode getting very slow on delete 
> operations, to the point where IBRs get blocked frequently and files fail to 
> close. On one cluster in particular, after about 4 weeks of uptime, the 
> Namenode started responding very poorly. Restarting it corrected the problem 
> for another 4 weeks. 
> In that example, jstacks in the namenode always pointed to slow operations 
> around a HDFS delete call which was performing an operation on the 
> FoldedTreeSet structure. The captured jstacks always pointed at an operation 
> on the folded tree set each time they were sampled:
> {code}
> "IPC Server handler 573 on 8020" #663 daemon prio=5 os_prio=0 
> tid=0x00007fe6a4087800 nid=0x97a6 runnable [0x00007fe67bdfd000]
>    java.lang.Thread.State: RUNNABLE
>       at 
> org.apache.hadoop.hdfs.util.FoldedTreeSet.removeAndGet(FoldedTreeSet.java:879)
>       at 
> org.apache.hadoop.hdfs.util.FoldedTreeSet.remove(FoldedTreeSet.java:911)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeStorageInfo.removeBlock(DatanodeStorageInfo.java:263)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.removeBlock(BlocksMap.java:194)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.removeBlock(BlocksMap.java:108)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.removeBlockFromMap(BlockManager.java:3676)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.removeBlock(BlockManager.java:3507)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.removeBlocks(FSNamesystem.java:4158)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:4132)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInt(FSNamesystem.java:4069)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:4053)
>       at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:845)
>       at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.delete(AuthorizationProviderProxyClientProtocol.java:308)
>       at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:603)
>       at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2216)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2212)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
> {code}
> The observation in this case, was that the namenode worked fine after a 
> restart and then at some point after about 4 weeks of uptime, this problem 
> started happening, and it would persist until the namenode was restarted. 
> Then the problem did not return for about another 4 weeks.
> On a completely different cluster and version, I recently came across a 
> problem where files were again failing to close (last block does not have 
> sufficient number of replicas) and the datanodes were logging a lot of 
> messages like the following:
> {code}
> 2019-11-27 09:00:49,678 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Took 21540ms to process 1 commands from NN
> {code}
> These messages had a range of durations and were fairly frequent. Focusing on 
> the longer messages at around 20 seconds and checking a few different 
> occurrences, jstacks showed as very similar problem to the NN issue, in that 
> the folderTreeSet seems to be at the root of the problem.
> The heartbeat thread is waiting on a lock:
> {code}
> "Thread-26" #243 daemon prio=5 os_prio=0 tid=0x00007f575225a000 nid=0x1d8bae 
> waiting for monitor entry [0x00007f571ed76000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>     at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2004)
>     - waiting to lock <0x00000006b79baf00> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
>     at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:699)
>     at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:645)
>     at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:730)
>     at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:539)
>     at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:696)
>     at java.lang.Thread.run(Thread.java:748)
> {code}
> What seemed to be blocking it is:
> {code}
> "DataXceiver for client DFSClient_NONMAPREDUCE_672634602_52 at 
> /10.71.224.51:37620 [Sending block 
> BP-551684957-10.71.224.13-1505987946791:blk_1683565334_609907893]" #142018893 
> daemon prio=5 os_prio=0 tid=0x00007f573cd25000 nid=0x1912a runnable 
> [0x00007f570acea000]
>    java.lang.Thread.State: RUNNABLE
>     at org.apache.hadoop.hdfs.util.FoldedTreeSet.get(FoldedTreeSet.java:449)
>     at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaMap.get(ReplicaMap.java:111)
>     - locked <0x00000006b79baf00> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
>     at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getFile(FsDatasetImpl.java:2199)
>     at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.validateBlockFile(FsDatasetImpl.java:1954)
>     - locked <0x00000006b79baf00> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
>     at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockFile(FsDatasetImpl.java:709)
>     at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockFile(FsDatasetImpl.java:702)
>     at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getMetaDataInputStream(FsDatasetImpl.java:220)
>     at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:301)
>     at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:537)
>     at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:148)
>     at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
>     at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246)
>     at java.lang.Thread.run(Thread.java:748)
> {code}
> Out of curiosity, we restarted 3 of the DNs, and 24 hours since the restart, 
> no messages like the above were logged, but they were getting logged at a 
> reasonable frequency before the restart.
> I have quite a few jstacks at 5 second frequency for some of the DNs on this 
> cluster and many samples have a runnable thread with this line at top of the 
> stack, which seems too frequent for a get operation:
> {code}
> at org.apache.hadoop.hdfs.util.FoldedTreeSet.get(FoldedTreeSet.java:449)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaMap.get(ReplicaMap.java:111)
> - locked <0x00000006b79baf00> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> {code}
> After the restart, the jstacks never show this line in any samples for at 
> least the next week while were were monitoring the cluster.
> Another interesting point, is that the JStack always shows line 449. This is 
> the method it is running and the line is highlighted:
> {code}
>   public E get(Object obj, Comparator<?> cmp) {
>     Objects.requireNonNull(obj);
>     Node<E> node = root;
>     while (node != null) {
>       E[] entries = node.entries;
>       int leftIndex = node.leftIndex;
>       int result = compare(obj, entries[leftIndex], cmp);
>       if (result < 0) {
>         node = node.left;
>       } else if (result == 0) {
>         return entries[leftIndex];
>       } else {
>         int rightIndex = node.rightIndex;
>         if (leftIndex != rightIndex) {
>           result = compare(obj, entries[rightIndex], cmp);
>         }
>         if (result == 0) {
>           return entries[rightIndex];
>         } else if (result > 0) {
>           node = node.right;
>         } else {
>           int low = leftIndex + 1;
>           int high = rightIndex - 1;
>           while (low <= high) {
>             int mid = (low + high) >>> 1;
>             result = compare(obj, entries[mid], cmp);
>             if (result > 0) {
>               low = mid + 1;
>             } else if (result < 0) {
>               high = mid - 1;
>             } else {
>               return entries[mid];
>             }
>           }
>           return null;
>         }
>       }
>     } // *** This is line 449 which the jstack always has at the top of the 
> stack.
>     return null;
>   }
> {code}
> I cannot explain why that line is always the "running line" except if its 
> somehow related to safepoints in the JVM.
> Based on these two examples and a few other similar issues I have encountered 
> in the namenode, I believe we have an issue with foldedTreeSet where it 
> somehow degrades over time, to the point that a simple get operation takes a 
> relatively long time. It does not seem to be able to recover from this with a 
> process restart.
> HDFS-13671 talks about reverting FoldedTreeSet (at least in the Namenode) for 
> performance reasons. However I wonder if the problems people have seen with 
> it are related to this performance degradation over time rather than the 
> initial performance?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to