[
https://issues.apache.org/jira/browse/HDFS-8694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14628371#comment-14628371
]
Lei (Eddy) Xu commented on HDFS-8694:
-------------------------------------
Thanks for the reviews, [~andrew.wang]
bq. I have a hard time understanding when we should call handle the disk error
vs. just bubbling up, since it bubbles there seems like a danger of handling
the same root IOE more than once. What's the methodology here? Is it possible
to move handling to the top-level somewhere? I can manually examine all the
current callsites and callers, but that's not very future-proof.
The reason that call {{volume#handleIOErrors()}} is that when the {{IOE}} pops
up to the place we used to call {{DataNode#checkDiskErrorAsync()}}, the context
(IOs on which volume) is usually missing. My intention was to call
{{volume#handleIOErrors()}} at the highest level that manages {{volume}} object
lifetime. I will try to get rid of {{DataNode#checkDiskErrorAsync()}} call in a
following JIRA.
bq. Since we now have the volume as context, we should really move the disk
checker to be per-volume rather than DN wide. One volume throwing an error is
no reason to check all of them. This can be deferred to a follow-up; I think
it's a slam dunk.
Yes. It is the reason to put {{hadnleIOErrors()}} in to {{FsVolumeSpi}}. I was
thinking to use a per-volume thread to do {{checkDirs()}} and also use
{{numOfErrors()}} as trigger. I will do it in a following JIRA as well.
Working on the rest of comments.
Thanks a lot for these great comments.
> Expose the stats of IOErrors on each FsVolume through JMX
> ---------------------------------------------------------
>
> Key: HDFS-8694
> URL: https://issues.apache.org/jira/browse/HDFS-8694
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: datanode, HDFS
> Affects Versions: 2.7.0
> Reporter: Lei (Eddy) Xu
> Assignee: Lei (Eddy) Xu
> Attachments: HDFS-8694.000.patch, HDFS-8694.001.patch
>
>
> Currently, once DataNode hits an {{IOError}} when writing / reading block
> files, it starts a background {{DiskChecker.checkDirs()}} thread. But if this
> thread successfully finishes, DN does not record this {{IOError}}.
> We need one measurement to count all {{IOErrors}} for each volume.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)