Hello-
We have discovered an issue with Jini 2.1 that can cause a client to hang if a service VM enters a bad state. While this issue is somewhat similar to http://issues.apache.org/jira/browse/RIVER-254, the patch that I would like to propose is a bit different, and still applicable to apache-river-2.1.1. I thought I would raise the issue here before filing a report in Jira... We had one of many service VMs in a cluster enter a bad state such that the [EMAIL PROTECTED] method repeatedly threw this type of exception: 20080718-133309.486 ERROR [STDERR] Jul 18, 2008 1:33:09 PM com.sun.jini.jeri.internal.runtime.SelectionManager$SelectLoop run WARNING: select loop throws java.lang.ArrayIndexOutOfBoundsException: -1 at sun.nio.ch.AbstractPollSelectorImpl.implDereg(AbstractPollSelectorImpl.j ava:140) at sun.nio.ch.SelectorImpl.processDeregisterQueue(SelectorImpl.java:121) at sun.nio.ch.PollSelectorImpl.doSelect(PollSelectorImpl.java:59) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:59) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:70) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:74) at com.sun.jini.jeri.internal.runtime.SelectionManager.waitForReadyKey(Sele ctionManager.java:364) at com.sun.jini.jeri.internal.runtime.SelectionManager.access$600(Selection Manager.java:80) at com.sun.jini.jeri.internal.runtime.SelectionManager$SelectLoop.run(Selec tionManager.java:287) at com.sun.jini.thread.ThreadPool$Worker.run(ThreadPool.java:136) at java.lang.Thread.run(Thread.java:534) This caused Mux.start on the client side to hang indefinitely due to a call to Object.wait without a timeout: "pool-1-thread-154" prio=1 tid=0x0852f768 nid=0x565c in Object.wait() [0x74809000..0x74809878] at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:429) at com.sun.jini.jeri.internal.mux.Mux.start(Mux.java:207) - locked <0x87ec4150> (a java.lang.Object) at net.jini.jeri.connection.ConnectionManager$OutboundMux.newRequest(Connec tionManager.java:356) - locked <0x87ec4158> (a net.jini.jeri.connection.ConnectionManager$OutboundMux) at net.jini.jeri.connection.ConnectionManager$ReqIterator.next(ConnectionMa nager.java:630) - locked <0x87ec41c0> (a net.jini.jeri.connection.ConnectionManager$ReqIterator) at net.jini.jeri.BasicObjectEndpoint$1.next(BasicObjectEndpoint.java:371) at net.jini.jeri.BasicInvocationHandler.invokeRemoteMethodOnce(BasicInvocat ionHandler.java:708) at net.jini.jeri.BasicInvocationHandler.invokeRemoteMethod(BasicInvocationH andler.java:659) at net.jini.jeri.BasicInvocationHandler.invoke(BasicInvocationHandler.java: 528) The Reaper thread later needed the ConnectionManager$OutboundMux held by the thread above: "(JSK) [EMAIL PROTECTED] bc6b4].Reaper" daemon prio=1 tid=0x7e0231c8 nid=0x5017 waiting for monitor entr y [0x72bda000..0x72bda878] at net.jini.jeri.connection.ConnectionManager$OutboundMux.checkIdle(Connect ionManager.java:377) - waiting to lock <0x87ec4158> (a net.jini.jeri.connection.ConnectionManager$OutboundMux) at net.jini.jeri.connection.ConnectionManager.checkIdle(ConnectionManager.j ava:256) - locked <0x86bdcfe8> (a net.jini.jeri.connection.ConnectionManager) at net.jini.jeri.connection.ConnectionManager$Reaper.run(ConnectionManager. java:571) - locked <0x86bdcfe8> (a net.jini.jeri.connection.ConnectionManager) at com.sun.jini.thread.ThreadPool$Worker.run(ThreadPool.java:136) at java.lang.Thread.run(Thread.java:534) All other threads attempting to use the service then blocked attempting to synchronize on the ConnectionManager held by the Reaper. We use a shared thread pool for all remote calls to all services. With just one of many service VMs in this bad state, the thread pool was eventually exhausted. All of its threads were waiting on the one ConnectionManager while the other services were still operable. I would like to see a RemoteException thrown on the client side if Mux.start takes too long to establish a connection. This would enable us to discard the bad service and retry with another. The patch that I would propose would look like this, except with the hard-coded timeout value changed to a property: synchronized (muxLock) { long timeToWaitUntil = System.currentTimeMillis() + 15000; while (!muxDown && !clientConnectionReady) { try { muxLock.wait(15000); if (System.currentTimeMillis() >= timeToWaitUntil) { setDown("client connection not ready within 15000 millis", null); } } catch (InterruptedException e) { setDown("interrupt waiting for connection header", e); } } if (muxDown) { IOException ioe = new IOException(muxDownMessage); ioe.initCause(muxDownCause); throw ioe; } } Is this a reasonable approach? I have verified that it has the desired effect by simulating the SelectionManager$SelectLoop errors on the server side (I'll leave that root cause analysis for a separate discussion). -Matt