[jira] [Commented] (HDFS-8533) Mismatch in displaying the "MissingBlock" count in fsck and in other metric reports

Wei-Chiu Chuang (JIRA) Wed, 18 Jul 2018 15:04:05 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-8533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548459#comment-16548459
 ]


Wei-Chiu Chuang commented on HDFS-8533:
---------------------------------------

I've been trying to reproduce this bug following the description in this jira, 
without success.

Here's my test code for future reference:
{code:java}
/** check if nn.getCorruptFiles() returns a file that has corrupted blocks */
  @Test (timeout=300000)
  public void testListCorruptFilesCorruptedBlock2() throws Exception {
    MiniDFSCluster cluster = null;
    Random random = new Random();

    try {
      Configuration conf = new HdfsConfiguration();
      conf.setInt(DFSConfigKeys.DFS_DATANODE_DIRECTORYSCAN_INTERVAL_KEY, 1); // 
datanode scans directories
      conf.setInt(DFSConfigKeys.DFS_BLOCKREPORT_INTERVAL_MSEC_KEY, 3 * 1000); 
// datanode sends block reports
      // Set short retry timeouts so this test runs faster
      conf.setInt(DFSConfigKeys.DFS_CLIENT_RETRY_WINDOW_BASE, 10);
      // start 2 DNs
      cluster = new MiniDFSCluster.Builder(conf).numDataNodes(2).build();
      FileSystem fs = cluster.getFileSystem();

      // create two files with one block each
      DFSTestUtil util = new DFSTestUtil.Builder().
          setName("testCorruptFilesCorruptedBlock").setNumFiles(2).
          setMaxLevels(1).setMaxSize(512).build();
      util.createFiles(fs, "/srcdat10");

      // Now deliberately corrupt one block
      String bpid = cluster.getNamesystem().getBlockPoolId();
      File storageDir = cluster.getInstanceStorageDir(0, 1);
      File data_dir = MiniDFSCluster.getFinalizedDir(storageDir, bpid);
      assertTrue("data directory does not exist", data_dir.exists());
      List<File> metaFiles = MiniDFSCluster.getAllBlockFiles(data_dir);
      assertTrue("Data directory does not contain any blocks or there was an "
          + "IO error", metaFiles != null && !metaFiles.isEmpty());
      File metaFile = metaFiles.get(0);
      RandomAccessFile file = new RandomAccessFile(metaFile, "rw");
      FileChannel channel = file.getChannel();
      long position = channel.size() - 2;
      int length = 2;
      byte[] buffer = new byte[length];
      random.nextBytes(buffer);
      channel.write(ByteBuffer.wrap(buffer), position);
      file.close();
      LOG.info("Deliberately corrupting file " + metaFile.getName() +
          " at offset " + position + " length " + length);


      // read all files to trigger detection of corrupted replica
      try {
        util.checkFiles(fs, "/srcdat10");
      } catch (BlockMissingException e) {
        System.out.println("Received BlockMissingException as expected.");
      } catch (IOException e) {
        assertTrue("Corrupted replicas not handled properly. Expecting 
BlockMissingException " +
            " but received IOException " + e, false);
      }

      LOG.info("Restarting Datanode to trigger BlockPoolSliceScanner");
      cluster.restartDataNodes();
      cluster.waitActive();

      cluster.stopDataNode(1);

      // fetch bad file list from namenode. There should be one file.
      final NameNode namenode = cluster.getNameNode();
      while (cluster.getNamesystem().getBlockManager().getMissingBlocksCount() 
== 0) {
        Thread.sleep(1000);
        LOG.info("Still waiting for missing block");
      }
      assertEquals(1, 
cluster.getNamesystem().getBlockManager().getMissingBlocksCount());
      Collection<FSNamesystem.CorruptFileBlockInfo> badFiles =
          namenode.getNamesystem().listCorruptFileBlocks("/", null);
      LOG.info("Namenode has bad files. " + badFiles.size());
      assertTrue("Namenode has " + badFiles.size() + " bad files. Expecting 1.",
          badFiles.size() == 1);
      util.cleanup(fs, "/srcdat10");
    } finally {
      if (cluster != null) { cluster.shutdown(); }
    }
  }
{code}

> Mismatch in displaying the "MissingBlock" count in fsck and in other metric 
> reports
> -----------------------------------------------------------------------------------
>
>                 Key: HDFS-8533
>                 URL: https://issues.apache.org/jira/browse/HDFS-8533
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.7.0
>            Reporter: J.Andreina
>            Assignee: J.Andreina
>            Priority: Critical
>
> Number of DN = 2
> Step 1: Write a file with replication factor - 3 .
> Step 2: Corrupt a replica in DN1
> Step 3: DN2 is down. 
> Missing Block count in  report is as follows
> Fsck report                                    : *0*
> Jmx, "dfsadmin -report" , UI, logs : *1*
> In fsck , only block whose replicas are all missed and not been corrupted are 
> counted 
> {code}
> if (totalReplicasPerBlock == 0 && !isCorrupt) {
>         // If the block is corrupted, it means all its available replicas are
>         // corrupted. We don't mark it as missing given these available 
> replicas
>         // might still be accessible as the block might be incorrectly marked 
> as
>         // corrupted by client machines.
> {code}
> While in other reports even if all the replicas are corrupted , block is been 
> considered as missed.
> Please provide your thoughts : can we make missing block count consistent 
> across all the reports same as implemented for fsck?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-8533) Mismatch in displaying the "MissingBlock" count in fsck and in other metric reports

Reply via email to