[jira] [Commented] (HDFS-10625) VolumeScanner to report why a block is found bad

Yongjun Zhang (JIRA) Wed, 27 Jul 2016 12:06:17 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15396168#comment-15396168
 ]


Yongjun Zhang commented on HDFS-10625:
--------------------------------------

Hi [~linyiqun] and [~shahrs87],

Sorry for the delay. I took a further look, and think it's good to include the 
HDFS-10626 fix here, and mark HDFS-10626 as a duplicate. I'd like to include 
both of you as contributers for this jira.

I looked at the latest patch here, it looks to me that the best place to fix is 
in BlockSender
{code}
  long sendBlock(DataOutputStream out, OutputStream baseStream, 
                 DataTransferThrottler throttler) throws IOException {
    final TraceScope scope = datanode.getTracer().
        newScope("sendBlock_" + block.getBlockId());
    try {
      return doSendBlock(out, baseStream, throttler);
    } finally {
      scope.close();
    }
  }
{code}

We can add a catch block here to catch the IOException thrown, then include the 
replica information and throw a new IO exception, e.g:
{code}
    try {
      return doSendBlock(out, baseStream, throttler);
    } catch (IOException ie) {
      // throw new IOE here with replica info
      throw new IOException(replicaInfoStr, ie);
    } finally {
      scope.close();
    }
{code}

There is a snippet in the constructor to get the replica info:
{code}
     final Replica replica;
      final long replicaVisibleLength;
      synchronized(datanode.data) { 
        replica = getReplica(block, datanode);
        replicaVisibleLength = replica.getVisibleLength();
      }
{code}
Looks like we can make this replica a member of BlockSender instead of a local 
variable here, so that we can refer to it when needed, such as for this jira. 
We probably should make {{replicaVisibleLength}} a member and report it as part 
of the replica info too, since when the writing is going on, this value may be 
changing concurrently. Hi [~vinayrpet], what do you think about this 
suggestion? 

Thanks.


>  VolumeScanner to report why a block is found bad
> -------------------------------------------------
>
>                 Key: HDFS-10625
>                 URL: https://issues.apache.org/jira/browse/HDFS-10625
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode, hdfs
>            Reporter: Yongjun Zhang
>            Assignee: Rushabh S Shah
>              Labels: supportability
>         Attachments: HDFS-10625-1.patch, HDFS-10625.patch
>
>
> VolumeScanner may report:
> {code}
> WARN org.apache.hadoop.hdfs.server.datanode.VolumeScanner: Reporting bad 
> blk_1170125248_96458336 on /d/dfs/dn
> {code}
> It would be helpful to report the reason why the block is bad, especially 
> when the block is corrupt, where is the first corrupted chunk in the block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-10625) VolumeScanner to report why a block is found bad

Reply via email to