[
https://issues.apache.org/jira/browse/HDFS-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991104#comment-14991104
]
Bob Hansen commented on HDFS-9103:
----------------------------------
I think the design is sound. We could create an interface and implementation
with mock objects for testing, but I don't think that's necessary in this case.
Include an abstraction (either by template or base class) for the clock to make
time testable.
Can we re-use the mock objects in our tests rather than re-declaring them?
{{bad_datanode_test.cc}} defines FileHandleTest. We should have the test name
match the filename.
Recoverable error test should be calling ASSERT_FALSE for clarity
I would like to see the {{FileHandle::Pread}} method implement the retry logic
internally so we have a simple "read all this data or completely fail" method
rather than forcing partial read and retry onto our consumer. Understanding
the logic that _these_ errors mean you should retry, but _this_ error means
that you shouldn't retry could be abstracted away as a kindness to the consumer.
Need tests for:
* BadDataNodeTracker as a testable unit. Hit all of its cases; it's pretty
small and it should be easy. We especially currently don't have any test
coverage for {{RemoveAllExpired}}.
* Doing a retry-and-recover
While I'm all about cleaning up formatting as we go, this is going to cause a
lot of conflicts with HDFS-9144 with no semantic difference. It does risk
reviewers missing something that changed because it looked like a formatting
fix. Can we punt some of these until HDFS-9144 lands? If it will be onerous
to pull out (e.g. not the last change you made in your local log), I can deal
with it on merge.
*Minor nits* that don't have to be corrected, but youmight want to peek at:
In IsBadNode:
* Reset the counter so there is never a chance for overflow
* remove_counter_ seems like the wrong name; perhaps check_counter?
* We're currently traversing the map up to 3 times in count(), get(), and
erase(); perhaps one lookup and keep the iterator?
In RemoveAllExpired:
{code}datanodes_.erase(it++);{code}
can be a bit more idiomatic c++11 with
{code}it = datanodes_.erase(it);{code}
> Retry reads on DN failure
> -------------------------
>
> Key: HDFS-9103
> URL: https://issues.apache.org/jira/browse/HDFS-9103
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: hdfs-client
> Reporter: Bob Hansen
> Assignee: James Clampffer
> Fix For: HDFS-8707
>
> Attachments: HDFS-9103.1.patch, HDFS-9103.2.patch,
> HDFS-9103.HDFS-8707.006.patch, HDFS-9103.HDFS-8707.007.patch,
> HDFS-9103.HDFS-8707.3.patch, HDFS-9103.HDFS-8707.4.patch,
> HDFS-9103.HDFS-8707.5.patch
>
>
> When AsyncPreadSome fails, add the failed DataNode to the excluded list and
> try again.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)