[ https://issues.apache.org/jira/browse/HADOOP-9916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Luke Lu updated HADOOP-9916: ---------------------------- Resolution: Fixed Fix Version/s: 2.1.1-beta Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Committed to trunk, branch-2, 2.1-beta. Thanks Binglin for the patch! > Race condition in ipc.Client causes TestIPC timeout > --------------------------------------------------- > > Key: HADOOP-9916 > URL: https://issues.apache.org/jira/browse/HADOOP-9916 > Project: Hadoop Common > Issue Type: Bug > Reporter: Binglin Chang > Assignee: Binglin Chang > Priority: Minor > Fix For: 2.1.1-beta > > Attachments: HADOOP-9916.v1.patch > > > TestIPC timeouts occasionally, for example: > [https://issues.apache.org/jira/browse/HDFS-5130?focusedCommentId=13749870&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13749870] > [https://issues.apache.org/jira/browse/HADOOP-9915?focusedCommentId=13753302&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13753302] > Look into the code, there is race condition in oah.ipc.Client, the race > condition happen between RPC call thread and connection read response thread: > {code} > if (status == RpcStatusProto.SUCCESS) { > Writable value = ReflectionUtils.newInstance(valueClass, conf); > value.readFields(in); // read value > call.setRpcResponse(value); > calls.remove(callId); > {code} > Read Thread: > Connection.receiveRpcResponse-> call.setRpcResponse(value) -> notify Call > Thread > Call Thread: > Client.call -> Connection.addCall(retry with the same callId) -> notify read > thread > Read Thread: > calls.remove(callId) # intend to remove old call, but removes newly added > call... > Connection.waitForWork end up wait maxIdleTime and close the connection. The > call never get response and dead. > The problem doesn't show because previously callId is unique, we never > accidentally remove newly added calls, but when retry added this race > condition became possible. > To solve this, we can simply change order, remove the call first, then notify > call thread. > Note there are many places need this order change(normal case, error case, > cleanup case) > And there are some minor issues in TestIPC: > 1. there are two method with same name: > void testSerial() > void testSerial(int handlerCount, boolean handlerSleep, ...) > the second is not a test case(so should not use testXXX prefix), but > somehow it causes testSerial(first one) run two times, see test report: > {code} > <testcase time="26.896" classname="org.apache.hadoop.ipc.TestIPC" > name="testSerial"/> > <testcase time="25.426" classname="org.apache.hadoop.ipc.TestIPC" > name="testSerial"/> > {code} > 2. timeout annotation should be added, so next time related log is available. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira