[ 
https://issues.apache.org/jira/browse/HDFS-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16828026#comment-16828026
 ] 

Stephen O'Donnell commented on HDFS-13677:
------------------------------------------

That failing test seems to be flaky. It failed sometimes and passed sometimes 
when I ran it locally even without the patch in place.

Borrowing a lot from the existing tests in TestDataNodeHotSwapVolumes, here is 
a test that reproduces the issue without the patch and passes with the patch in 
place:

{code}
  /**
   * Test re-adding one volume with some blocks on a running MiniDFSCluster
   * with only one NameNode to reproduce HDFS-13677.
   */
  @Test(timeout=60000)
  public void testReAddVolumeWithBlocks()
      throws IOException, ReconfigurationException,
      InterruptedException, TimeoutException {
    startDFSCluster(1, 1);
    String bpid = cluster.getNamesystem().getBlockPoolId();
    final int numBlocks = 10;

    Path testFile = new Path("/test");
    createFile(testFile, numBlocks);

    List<Map<DatanodeStorage, BlockListAsLongs>> blockReports =
        cluster.getAllBlockReports(bpid);
    assertEquals(1, blockReports.size());  // 1 DataNode
    assertEquals(2, blockReports.get(0).size());  // 2 volumes

    // Now remove the second volume
    DataNode dn = cluster.getDataNodes().get(0);
    Collection<String> oldDirs = getDataDirs(dn);
    String newDirs = oldDirs.iterator().next();  // Keep the first volume.
    assertThat(
        "DN did not update its own config",
        dn.reconfigurePropertyImpl(
            DFSConfigKeys.DFS_DATANODE_DATA_DIR_KEY, newDirs),
        is(dn.getConf().get(DFS_DATANODE_DATA_DIR_KEY)));
    assertFileLocksReleased(
      new ArrayList<String>(oldDirs).subList(1, oldDirs.size()));

    // Now create another file - the first volume should have 15 blocks
    // and 5 blocks on the previously removed volume
    createFile(new Path("/test2"), numBlocks);
    dn.scheduleAllBlockReport(0);
    blockReports = cluster.getAllBlockReports(bpid);

    assertEquals(1, blockReports.size());  // 1 DataNode
    assertEquals(1, blockReports.get(0).size());  // 1 volume
    for (BlockListAsLongs blockList : blockReports.get(0).values()) {
      assertEquals(15, blockList.getNumberOfBlocks());
    }

    // Now add the original volume back again and ensure 15 blocks are reported
    assertThat(
        "DN did not update its own config",
        dn.reconfigurePropertyImpl(
            DFSConfigKeys.DFS_DATANODE_DATA_DIR_KEY, String.join(",", oldDirs)),
        is(dn.getConf().get(DFS_DATANODE_DATA_DIR_KEY)));
    dn.scheduleAllBlockReport(0);
    blockReports = cluster.getAllBlockReports(bpid);

    assertEquals(1, blockReports.size());  // 1 DataNode
    assertEquals(2, blockReports.get(0).size());  // 2 volumes

    // The order of the block reports is not guaranteed. As we expect 2, get the
    // max block count and the min block count and then assert on that.
    int minNumBlocks = Integer.MAX_VALUE;
    int maxNumBlocks = Integer.MIN_VALUE;
    for (BlockListAsLongs blockList : blockReports.get(0).values()) {
      minNumBlocks = Math.min(minNumBlocks, blockList.getNumberOfBlocks());
      maxNumBlocks = Math.max(maxNumBlocks, blockList.getNumberOfBlocks());
    }
    assertEquals(5, minNumBlocks);
    assertEquals(15, maxNumBlocks);
  }
{code}

Without the patch, the second last assertEquals will fail as zero blocks will 
be reported from the volume that was not removed instead of 15.

Feel free to use the above test in the patch or refactor it as needed.

> Dynamic refresh Disk configuration results in overwriting VolumeMap
> -------------------------------------------------------------------
>
>                 Key: HDFS-13677
>                 URL: https://issues.apache.org/jira/browse/HDFS-13677
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: xuzq
>            Priority: Major
>         Attachments: HDFS-13677-001.patch, image-2018-06-14-13-05-54-354.png, 
> image-2018-06-14-13-10-24-032.png
>
>
> When I added a new disk by dynamically refreshing the configuration, an 
> exception "FileNotFound while finding block" was caused.
>  
> The steps are as follows:
> 1.Change the hdfs-site.xml of DataNode to add a new disk.
> 2.Refresh the configuration by "./bin/hdfs dfsadmin -reconfig datanode 
> ****:50020 start"
>  
> The error is like:
> ```
> VolumeScannerThread(/media/disk5/hdfs/dn): FileNotFound while finding block 
> BP-233501496-*.*.*.*-1514185698256:blk_1620868560_547245090 on volume 
> /media/disk5/hdfs/dn
> org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica not 
> found for BP-1997955181-*.*.*.*-1514186468560:blk_1090885868_17145082
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.getReplica(BlockSender.java:471)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:240)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:553)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:148)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:254)
>  at java.lang.Thread.run(Thread.java:748)
> ```
> I added some logs for confirmation, as follows:
> Log Code like:
> !image-2018-06-14-13-05-54-354.png!
> And the result is like:
> !image-2018-06-14-13-10-24-032.png!  
> The Size of 'VolumeMap' has been reduced, and We found the 'VolumeMap' to be 
> overridden by the new Disk Block by the method 'ReplicaMap.addAll(ReplicaMap 
> other)'.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to