[jira] [Commented] (HDFS-13768) Adding replicas to volume map makes DataNode start slowly

Surendra Singh Lilhore (JIRA) Tue, 18 Sep 2018 01:08:16 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-13768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16618651#comment-16618651
 ]


Surendra Singh Lilhore commented on HDFS-13768:
-----------------------------------------------

[~linyiqun] and [~arpitagarwal], I found some more improvment point.


 !screenshot-1.png!

If you see this screenshot {{FsDatasetUtil.getGenrationStampFromFile()}} and 
{{lock()}} is taking so much time. This can be improved.

1. {{FsDatasetUtil.getGenrationStampFromFile()}}, This method loop over the all 
the file to find out the meta file. This is time consuming.
 We can sort the list of file's and in sorted list next to block file metafile 
will be there, so we no need to loop and find out the meta file.
{code:java}
  /**
   * Find the meta-file for the specified block file
   * and then return the generation stamp from the name of the meta-file.
   */
  static long getGenerationStampFromFile(List<File> files, File blockFile, int 
index)
      throws IOException {
    String blockName = blockFile.getName();
    if ((index + 1) < files.size()) {
      // Check if next index file is meta file
      String metaFile = files.get(index + 1).getName();
      if (metaFile.startsWith(blockName)) {
        return Block.getGenerationStamp(metaFile);
      }
    }
    //Search meta file in list 
    for (int j = 0; j < files.size(); j++) {
      ......
      ......
      return Block.getGenerationStamp(files.get(j).getName());
    }
    FsDatasetImpl.LOG.warn("Block " + blockFile + " does not have a metafile!");
    return HdfsConstants.GRANDFATHER_GENERATION_STAMP;
  }
{code}

After this change *60% datanode statup time* is redused in my cluster.

2. {{BlockPoolSlice.addReplicaToReplicasMap()}}, in this method first its find 
the oldReplica from {{ReplicaMap}} and if it is null then it will add the 
ReplicaInfo in {{ReplicaMap}}. For this its need to get the lock two time. Its 
better to add one method in ReplicaMap for this work, so this work can be done 
in one lock. 

*Example:*
{code:java}
  /**
   * Add a replica's meta information into the map, if already exist return the
   * old replicaInfo
   * 
   * @param bpid
   * @param replicaInfo
   * @return
   */
  ReplicaInfo addAndGet(String bpid, ReplicaInfo replicaInfo) {
    checkBlockPool(bpid);
    checkBlock(replicaInfo);
    try (AutoCloseableLock l = lock.acquire()) {
      FoldedTreeSet<ReplicaInfo> set = map.get(bpid);
      if (set == null) {
        // Add an entry for block pool if it does not exist already
        set = new FoldedTreeSet<>();
        map.put(bpid, set);
      }
      ReplicaInfo oldReplicaInfo = set.get(replicaInfo.getBlockId(),
          LONG_AND_BLOCK_COMPARATOR);
      if (oldReplicaInfo != null) {
        return oldReplicaInfo;
      } else {
        set.add(replicaInfo);
      }
      return replicaInfo;
    }
  }
{code}

If you both are agree for this change then I will add this in next patch.

>  Adding replicas to volume map makes DataNode start slowly 
> -----------------------------------------------------------
>
>                 Key: HDFS-13768
>                 URL: https://issues.apache.org/jira/browse/HDFS-13768
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 3.1.0
>            Reporter: Yiqun Lin
>            Assignee: Surendra Singh Lilhore
>            Priority: Major
>         Attachments: HDFS-13768.01.patch, HDFS-13768.02.patch, 
> HDFS-13768.patch, screenshot-1.png
>
>
> We find DN starting so slowly when rolling upgrade our cluster. When we 
> restart DNs, the DNs start so slowly and not register to NN immediately. And 
> this cause a lots of following error:
> {noformat}
> DataXceiver error processing WRITE_BLOCK operation  src: /xx.xx.xx.xx:64360 
> dst: /xx.xx.xx.xx:50010
> java.io.IOException: Not ready to serve the block pool, 
> BP-1508644862-xx.xx.xx.xx-1493781183457.
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.checkAndWaitForBP(DataXceiver.java:1290)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.checkAccess(DataXceiver.java:1298)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:630)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Looking into the logic of DN startup, it will do the initial block pool 
> operation before the registration. And during initializing block pool 
> operation, we found the adding replicas to volume map is the most expensive 
> operation.  Related log:
> {noformat}
> 2018-07-26 10:46:23,771 INFO [Thread-105] 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time to 
> add replicas to map for block pool BP-1508644862-xx.xx.xx.xx-1493781183457 on 
> volume /home/hard_disk/1/dfs/dn/current: 242722ms
> 2018-07-26 10:46:26,231 INFO [Thread-109] 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time to 
> add replicas to map for block pool BP-1508644862-xx.xx.xx.xx-1493781183457 on 
> volume /home/hard_disk/5/dfs/dn/current: 245182ms
> 2018-07-26 10:46:32,146 INFO [Thread-112] 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time to 
> add replicas to map for block pool BP-1508644862-xx.xx.xx.xx-1493781183457 on 
> volume /home/hard_disk/8/dfs/dn/current: 251097ms
> 2018-07-26 10:47:08,283 INFO [Thread-106] 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time to 
> add replicas to map for block pool BP-1508644862-xx.xx.xx.xx-1493781183457 on 
> volume /home/hard_disk/2/dfs/dn/current: 287235ms
> {noformat}
> Currently DN uses independent thread to scan and add replica for each volume, 
> but we still need to wait the slowest thread to finish its work. So the main 
> problem here is that we could make the thread to run faster.
> The jstack we get when DN blocking in the adding replica:
> {noformat}
> "Thread-113" #419 daemon prio=5 os_prio=0 tid=0x00007f40879ff000 nid=0x145da 
> runnable [0x00007f4043a38000]
>    java.lang.Thread.State: RUNNABLE
>       at java.io.UnixFileSystem.list(Native Method)
>       at java.io.File.list(File.java:1122)
>       at java.io.File.listFiles(File.java:1207)
>       at org.apache.hadoop.fs.FileUtil.listFiles(FileUtil.java:1165)
>       at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.addToReplicasMap(BlockPoolSlice.java:445)
>       at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.addToReplicasMap(BlockPoolSlice.java:448)
>       at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.addToReplicasMap(BlockPoolSlice.java:448)
>       at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.getVolumeMap(BlockPoolSlice.java:342)
>       at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getVolumeMap(FsVolumeImpl.java:864)
>       at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$1.run(FsVolumeList.java:191)
> {noformat}
> One improvement maybe we can use ForkJoinPool to do this recursive task, 
> rather than a sync way. This will be a great improvement because it can 
> greatly speed up recovery process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-13768) Adding replicas to volume map makes DataNode start slowly

Reply via email to