[jira] [Commented] (HDFS-8694) Expose the stats of IOErrors on each FsVolume through JMX

Andrew Wang (JIRA) Tue, 14 Jul 2015 10:59:37 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-8694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14626770#comment-14626770
 ]


Andrew Wang commented on HDFS-8694:
-----------------------------------

Hi Eddy, overall this patch looks great, thanks for working on it. Some review 
comments:

High-level:
* I have a hard time understanding when we should call handle the disk error 
vs. just bubbling up, since it bubbles there seems like a danger of handling 
the same root IOE more than once. What's the methodology here? Is it possible 
to move handling to the top-level somewhere? I can manually examine all the 
current callsites and callers, but that's not very future-proof.
* Related to the above, our unit tests do not cover anywhere close to all the 
locations that handle an IOError. Adding tests for all of these would be 
onerous, so very interested in a solution to the above.
* Since we now have the volume as context, we should really move the disk 
checker to be per-volume rather than DN wide. One volume throwing an error is 
no reason to check all of them. This can be deferred to a follow-up; I think 
it's a slam dunk.

Nits:

* FsDatasetImpl#moveBlockAcrossStorage, can we get rid of volume and move 
targetVolume's declaration outside of the try? Looks equal to volume.
* In places like BlockSender#close we actually could IOE multiple times but 
only increment once. Thoughts?
* Extra debug print in TestDataTransferKeepalive#testSlowReader
* Linebreak here is kind of awkward, move the end parens or the try up?

{code}
    try (ReplicaHandler replica = dataset.append(
        block, block.getGenerationStamp() + 1, block.getNumBytes())
    ) {
{code}

* TestMover adds a test timeout, looks unrelated to this patch?

> Expose the stats of IOErrors on each FsVolume through JMX
> ---------------------------------------------------------
>
>                 Key: HDFS-8694
>                 URL: https://issues.apache.org/jira/browse/HDFS-8694
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode, HDFS
>    Affects Versions: 2.7.0
>            Reporter: Lei (Eddy) Xu
>            Assignee: Lei (Eddy) Xu
>         Attachments: HDFS-8694.000.patch, HDFS-8694.001.patch
>
>
> Currently, once DataNode hits an {{IOError}} when writing / reading block 
> files, it starts a background {{DiskChecker.checkDirs()}} thread. But if this 
> thread successfully finishes, DN does not record this {{IOError}}. 
> We need one measurement to count all {{IOErrors}} for each volume.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8694) Expose the stats of IOErrors on each FsVolume through JMX

Reply via email to