[ 
https://issues.apache.org/jira/browse/HDFS-8694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14628371#comment-14628371
 ] 

Lei (Eddy) Xu commented on HDFS-8694:
-------------------------------------

Thanks for the reviews, [~andrew.wang]

bq. I have a hard time understanding when we should call handle the disk error 
vs. just bubbling up, since it bubbles there seems like a danger of handling 
the same root IOE more than once. What's the methodology here? Is it possible 
to move handling to the top-level somewhere? I can manually examine all the 
current callsites and callers, but that's not very future-proof.

The reason that call {{volume#handleIOErrors()}} is that when the {{IOE}} pops 
up to the place we used to call {{DataNode#checkDiskErrorAsync()}}, the context 
(IOs on which volume) is usually missing. My intention was to call 
{{volume#handleIOErrors()}} at the highest level that manages {{volume}} object 
lifetime. I will try to get rid of {{DataNode#checkDiskErrorAsync()}} call in a 
following JIRA.

bq. Since we now have the volume as context, we should really move the disk 
checker to be per-volume rather than DN wide. One volume throwing an error is 
no reason to check all of them. This can be deferred to a follow-up; I think 
it's a slam dunk.

Yes. It is the reason to put {{hadnleIOErrors()}} in to {{FsVolumeSpi}}. I was 
thinking to use a per-volume thread to do {{checkDirs()}} and also use 
{{numOfErrors()}} as trigger. I will do it in a following JIRA as well.

Working on the rest of comments.

Thanks a lot for these great comments.

> Expose the stats of IOErrors on each FsVolume through JMX
> ---------------------------------------------------------
>
>                 Key: HDFS-8694
>                 URL: https://issues.apache.org/jira/browse/HDFS-8694
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode, HDFS
>    Affects Versions: 2.7.0
>            Reporter: Lei (Eddy) Xu
>            Assignee: Lei (Eddy) Xu
>         Attachments: HDFS-8694.000.patch, HDFS-8694.001.patch
>
>
> Currently, once DataNode hits an {{IOError}} when writing / reading block 
> files, it starts a background {{DiskChecker.checkDirs()}} thread. But if this 
> thread successfully finishes, DN does not record this {{IOError}}. 
> We need one measurement to count all {{IOErrors}} for each volume.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to