[ 
https://issues.apache.org/jira/browse/HDFS-9890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15268988#comment-15268988
 ] 

James Clampffer commented on HDFS-9890:
---------------------------------------

Patch looks good.  I have some comments/open questions:

Can we push "random() % RANDOM_ERROR_RATIO" into the callbacks everywhere so 
that
{code}
if (status.ok() && random() % RANDOM_ERROR_RATIO == 0 && event_resp.response() 
== event_response::kTest_Error)
{code}
turns into
{code}
if (status.ok() && event_resp.response() == event_response::kTest_Error)
{code}
So that the callback can trigger failures deterministically if needed.  It 
looks like you already took care of this in filehandle.cc
{code}
    event_response event_resp = event_handlers->call(FILE_DN_CONNECT_EVENT, 
cluster_name.c_str(), path.c_str(), 0);
 #ifndef NDEBUG
     if (event_resp.response() == event_response::kTest_Error) {
       status = event_resp.status();
{code}

BlockReader shouldn't need to hold a shared_ptr to the event_handlers_ object.  
As far as object lifetimes go FileSystem/FileHandle should always outlive block 
readers (it's a bug if they don't).  Fewer shared_ptrs means fewer bus 
locks/cache invalidations to deal with atomic integer operations; those can add 
up on NUMA systems.  I know I've added some where they aren't need 
(BlockReader::cancel_state_) that I've been trying to remove whenever I'm 
working on those bits.

Another question I have but not really related to this patch is if 
event_response needs to be a seperate class from Status.  At least in this 
usage they get effectively the same thing done but maybe there are others I'm 
not thinking about.


> libhdfs++: Add test suite to simulate network issues
> ----------------------------------------------------
>
>                 Key: HDFS-9890
>                 URL: https://issues.apache.org/jira/browse/HDFS-9890
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs-client
>            Reporter: James Clampffer
>            Assignee: Xiaowei Zhu
>         Attachments: HDFS-9890.HDFS-8707.000.patch, 
> HDFS-9890.HDFS-8707.001.patch, HDFS-9890.HDFS-8707.002.patch, 
> HDFS-9890.HDFS-8707.003.patch
>
>
> I propose adding a test suite to simulate various network issues/failures in 
> order to get good test coverage on some of the retry paths that aren't easy 
> to hit in mock unit tests.
> At the moment the only things that hit the retry paths are the gmock unit 
> tests.  The gmock are only as good as their mock implementations which do a 
> great job of simulating protocol correctness but not more complex 
> interactions.  They also can't really simulate the types of lock contention 
> and subtle memory stomps that show up while doing hundreds or thousands of 
> concurrent reads.   We should add a new minidfscluster test that focuses on 
> heavy read/seek load and then randomly convert error codes returned by 
> network functions into errors.
> List of things to simulate(while heavily loaded), roughly in order of how 
> badly I think they need to be tested at the moment:
> -Rpc connection disconnect
> -Rpc connection slowed down enough to cause a timeout and trigger retry
> -DN connection disconnect



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to