I've experienced occasional cases where clients get stuck in the
following block of code in Mux.start. Has anyone experienced this
problem? I have a proposed solution below. Has anyone thought about a
similar solution already?
-- Current code --
1 asyncSendClientConnectionHeader();
2 synchronized (muxLock) {
3 while (!muxDown && !clientConnectionReady) {
4 try {
5 muxLock.wait(); // REMIND: timeout?
6 } catch (InterruptedException e) {
7 ...
8 }
9 }
10 if (muxDown) {
11 IOException ioe = new IOException(muxDownMessage);
12 ioe.initCause(muxDownCause);
13 throw ioe;
14 }
15 }
-- Explanation of the code --
This code handles the initial client-server handshake that starts a JERI
connection. In line 1, the client sends its 8-byte greeting to the
server. Then in the loop on lines 3-9, it waits for the server's
response. If the reader thread gets a satisfactory response from the
server, it sets clientConnectionReady=true and calls
muxLock.notifyAll(). In all other cases (aborted connection, mismatched
protocol version, etc) the reader invokes Mux.setDown() which sets
muxDown=true and calls muxLock.notifyAll(). In lines 10-14, it throws if
the handshake was a failure.
In my scenario (which uses simple TCP sockets, nothing fancy), the
invoker thread sits on line 5 indefinitely. My problem hard to
reproduce, so I haven't found out what the server is doing in this case.
I hope to figure that out eventually, but presently I'm interested in
the "REMIND: timeout?" comment.
-- Timeout solution --
It seems obvious to me that there should be a timeout here. There are
lots of imaginable cases where the client could get stuck here:
server-side deadlock, abrupt server crash, logic error in client Mux
code. You'd expect that the server would either respond with its 8-byte
handshake very quickly or never, so a modest timeout (like 15 or 30
seconds) should be good. If that timeout is triggered, I would expect
that the code above would call Mux.setDown() and throw an IOException.
That exception would either cause a retry or be thrown up to the invoker
as a RemoteException.
-- Proposed code (untested) --
3 long now = System.currentTimeMillis();
4 long endTime = now + timeoutMillis;
5 while (!muxDown && !clientConnectionReady) {
6 if (now >= endTime) {
7 setDown("timeout waiting for server to respond
to handshake", null);
8 } else {
9 try {
10 muxLock.wait(endTime - now);
11 now = System.currentTimeMillis();
12 } catch (InterruptedException e) {
13 setDown("interrupt waiting for connection
header", e);
14 }
15 }
16 }
This code assumes a configurable timeoutMillis parameter has been set
earlier.
I can't think of any alternative solutions. Putting the timeout in the
Reader logic seems higher risk. There's incomplete code in JERI to
implement a ping packet (see Mux.asyncSendPing, never used), but that
would only be relevant after the initial handshake and wouldn't help
here.
Thanks,
Chris