[jira] [Commented] (HDFS-9103) Retry reads on DN failure

Bob Hansen (JIRA) Wed, 04 Nov 2015 20:24:39 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991104#comment-14991104
 ]


Bob Hansen commented on HDFS-9103:
----------------------------------

I think the design is sound.  We could create an interface and implementation 
with mock objects for testing, but I don't think that's necessary in this case. 
 

Include an abstraction (either by template or base class) for the clock to make 
time testable.

Can we re-use the mock objects in our tests rather than re-declaring them?
{{bad_datanode_test.cc}} defines FileHandleTest.  We should have the test name 
match the filename.
Recoverable error test should be calling ASSERT_FALSE for clarity

I would like to see the {{FileHandle::Pread}} method implement the retry logic 
internally so we have a simple "read all this data or completely fail" method 
rather than forcing partial read and retry onto our consumer.  Understanding 
the logic that _these_ errors mean you should retry, but _this_ error means 
that you shouldn't retry could be abstracted away as a kindness to the consumer.

Need tests for:
* BadDataNodeTracker as a testable unit.  Hit all of its cases; it's pretty 
small and it should be easy.  We especially currently don't have any test 
coverage for {{RemoveAllExpired}}. 
* Doing a retry-and-recover
  
While I'm all about cleaning up formatting as we go, this is going to cause a 
lot of conflicts with HDFS-9144 with no semantic difference.  It does risk 
reviewers missing something that changed because it looked like a formatting 
fix.  Can we punt some of these until HDFS-9144 lands?  If it will be onerous 
to pull out (e.g. not the last change you made in your local log), I can deal 
with it on merge.

*Minor nits* that don't have to be corrected, but youmight want to peek at:
In IsBadNode:
*  Reset the counter so there is never a chance for overflow
*  remove_counter_ seems like the wrong name; perhaps check_counter?
*  We're currently traversing the map up to 3 times in count(), get(), and 
erase(); perhaps one lookup and keep the iterator?
  
In RemoveAllExpired:
{code}datanodes_.erase(it++);{code}
can be a bit more idiomatic c++11 with 
{code}it = datanodes_.erase(it);{code}


> Retry reads on DN failure
> -------------------------
>
>                 Key: HDFS-9103
>                 URL: https://issues.apache.org/jira/browse/HDFS-9103
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs-client
>            Reporter: Bob Hansen
>            Assignee: James Clampffer
>             Fix For: HDFS-8707
>
>         Attachments: HDFS-9103.1.patch, HDFS-9103.2.patch, 
> HDFS-9103.HDFS-8707.006.patch, HDFS-9103.HDFS-8707.007.patch, 
> HDFS-9103.HDFS-8707.3.patch, HDFS-9103.HDFS-8707.4.patch, 
> HDFS-9103.HDFS-8707.5.patch
>
>
> When AsyncPreadSome fails, add the failed DataNode to the excluded list and 
> try again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-9103) Retry reads on DN failure

Reply via email to