[
https://issues.apache.org/jira/browse/HDFS-8694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14626770#comment-14626770
]
Andrew Wang commented on HDFS-8694:
-----------------------------------
Hi Eddy, overall this patch looks great, thanks for working on it. Some review
comments:
High-level:
* I have a hard time understanding when we should call handle the disk error
vs. just bubbling up, since it bubbles there seems like a danger of handling
the same root IOE more than once. What's the methodology here? Is it possible
to move handling to the top-level somewhere? I can manually examine all the
current callsites and callers, but that's not very future-proof.
* Related to the above, our unit tests do not cover anywhere close to all the
locations that handle an IOError. Adding tests for all of these would be
onerous, so very interested in a solution to the above.
* Since we now have the volume as context, we should really move the disk
checker to be per-volume rather than DN wide. One volume throwing an error is
no reason to check all of them. This can be deferred to a follow-up; I think
it's a slam dunk.
Nits:
* FsDatasetImpl#moveBlockAcrossStorage, can we get rid of volume and move
targetVolume's declaration outside of the try? Looks equal to volume.
* In places like BlockSender#close we actually could IOE multiple times but
only increment once. Thoughts?
* Extra debug print in TestDataTransferKeepalive#testSlowReader
* Linebreak here is kind of awkward, move the end parens or the try up?
{code}
try (ReplicaHandler replica = dataset.append(
block, block.getGenerationStamp() + 1, block.getNumBytes())
) {
{code}
* TestMover adds a test timeout, looks unrelated to this patch?
> Expose the stats of IOErrors on each FsVolume through JMX
> ---------------------------------------------------------
>
> Key: HDFS-8694
> URL: https://issues.apache.org/jira/browse/HDFS-8694
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: datanode, HDFS
> Affects Versions: 2.7.0
> Reporter: Lei (Eddy) Xu
> Assignee: Lei (Eddy) Xu
> Attachments: HDFS-8694.000.patch, HDFS-8694.001.patch
>
>
> Currently, once DataNode hits an {{IOError}} when writing / reading block
> files, it starts a background {{DiskChecker.checkDirs()}} thread. But if this
> thread successfully finishes, DN does not record this {{IOError}}.
> We need one measurement to count all {{IOErrors}} for each volume.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)