[ 
https://issues.apache.org/jira/browse/HADOOP-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13509416#comment-13509416
 ] 

Chris Nauroth commented on HADOOP-8980:
---------------------------------------

Hi, Xuan.  I took another look at the {{TestRPC#testErrorMsgForInsecureClient}} 
failure, and I still think it's a race condition on the server side.

Specifically, {{Server.Connection#readAndProcess}} calls 
{{Server.Connection#initializeAuthContext}} to check authentication, and if it 
fails, sets up the "authentication is not enabled" response, and enqueues the 
response by calling {{responder.doRespond(authFailedCall)}}.  The responder 
runs a separate thread that loops, dequeues responses, and writes them in 
{{Server.Responder#processResponse}}.  Meanwhile, back in the thread of 
{{Server.Connection#readAndProcess}}, the authentication failure also causes it 
to throw an IOException.  This propagates up to {{Server.Reader#doRead}}, which 
closes the connection.  If we are unlucky enough to have the connection get 
closed before the responder thread gets a chance to write the response, then 
the client doesn't receive the expected response message, and instead we get 
this exception about connection abort.  It appears that Windows consistently 
schedules threads just right to expose this problem.

It's possible that your experiment to insert a Thread.sleep in the client-side 
code interfered with the thread scheduling in such a way that it masked the 
problem and made the test pass.  It's all running on the same machine, in the 
same process.

In order to validate my theory that it's a server-side race condition, I came 
up with an experiment that doesn't involve inserting sleep calls that might 
interfere with timing.  In {{Server.Reader#doRead}}, I commented out the 
{{closeConnection(c)}} call.  The test consistently passed when I did this, so 
I think that validates the theory that it's a server-side problem, and that one 
side of the race condition is the connection close.

This might indicate that we need to change the {{Server}} code to send the 
"authentication is not enabled" response synchronously, bypassing the 
{{Responder}} queue, or finding some other way to chain the connection close 
after the response is handled normally from the queue.
                
> TestRPC fails on Windows
> ------------------------
>
>                 Key: HADOOP-8980
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8980
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: trunk-win
>            Reporter: Chris Nauroth
>            Assignee: Chris Nauroth
>
> This failure may indicate a difference in socket handling on Windows.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to