[
https://issues.apache.org/jira/browse/HADOOP-2789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566415#action_12566415
]
Raghu Angadi commented on HADOOP-2789:
--------------------------------------
Could you attach the log file when the test failed with and/or without the
patch?
Most not the cause of test failure but related : doPurge() removes the
responses that have waited for too long. But it does not check if they are
partially written. If we are not able to send a partially write response, I
think should close the connection (or update the recvTime when any data is
written).
> Race condition in ipc.Server prevents responce being written back to client.
> ----------------------------------------------------------------------------
>
> Key: HADOOP-2789
> URL: https://issues.apache.org/jira/browse/HADOOP-2789
> Project: Hadoop Core
> Issue Type: Bug
> Components: ipc
> Affects Versions: 0.16.0
> Reporter: Clint Morgan
> Priority: Critical
> Attachments: HADOOP-2789.patch
>
>
> I encountered a race condition in ipc.Server when writing the response
> back to the socket. Sometimes the write SelectKey is being canceled
> when it should not be, and thus the full response never gets
> written. This results in clients timing out on the socket while waiting for
> the response.
> I am attaching a unit test that demonstrates the problem. It follows
> closely the TestIPC test, however the socket output buffer is set
> smaller than the result being sent back, so that partial writes
> occur. I also put random sleep in the client to help provoke the race
> condition.
> On my machine this fails over half of the time.
> Looking at the code in ipc.Server.java. The problem is manifested in
> Responder.doAsyncWrite(). If I comment out the key.cancel() line, then
> everything works fine.
> So we need to identify when to safely cancel the key.
> I tried the following:
> {noformat}
> private void doAsyncWrite(SelectionKey key) throws IOException {
> Call call = (Call)key.attachment();
> if (call == null) {
> return;
> }
> if (key.channel() != call.connection.channel) {
> throw new IOException("doAsyncWrite: bad channel");
> }
> if (processResponse(call.connection.responseQueue)) {
> synchronized(call.connection.responseQueue) {
> if (call.connection.responseQueue.size() == 0) {
> LOG.info("Cancelling key for call "+call.toString()+ " key:
> "+ key.toString());
> key.cancel(); // remove item from selector.
> } else {
> LOG.warn("NOT REALLY DONE: "+call.toString()+ " key: "+
> key.toString());
> }
> }
> }
> }
> {noformat}
> And this does catch some of the cases (EG, the LOG.warn message gets hit),
> but i still hit the race condition.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.