critical Mux.start issue

O'Keefe, Matthew Mon, 11 Aug 2008 21:35:49 -0700

Hello-


We have discovered an issue with Jini 2.1 that can cause a client to
hang if a service VM enters a bad state.  While this issue is somewhat
similar to http://issues.apache.org/jira/browse/RIVER-254, the patch
that I would like to propose is a bit different, and still applicable to
apache-river-2.1.1.  I thought I would raise the issue here before
filing a report in Jira...

 

We had one of many service VMs in a cluster enter a bad state such that
the [EMAIL PROTECTED] method repeatedly threw this type of
exception:

 

20080718-133309.486 ERROR [STDERR] Jul 18, 2008 1:33:09 PM
com.sun.jini.jeri.internal.runtime.SelectionManager$SelectLoop run
WARNING: select loop throws
java.lang.ArrayIndexOutOfBoundsException: -1
at
sun.nio.ch.AbstractPollSelectorImpl.implDereg(AbstractPollSelectorImpl.j
ava:140)
at sun.nio.ch.SelectorImpl.processDeregisterQueue(SelectorImpl.java:121)
at sun.nio.ch.PollSelectorImpl.doSelect(PollSelectorImpl.java:59)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:59)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:70)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:74)
at
com.sun.jini.jeri.internal.runtime.SelectionManager.waitForReadyKey(Sele
ctionManager.java:364)
at
com.sun.jini.jeri.internal.runtime.SelectionManager.access$600(Selection
Manager.java:80)
at
com.sun.jini.jeri.internal.runtime.SelectionManager$SelectLoop.run(Selec
tionManager.java:287)
at com.sun.jini.thread.ThreadPool$Worker.run(ThreadPool.java:136)
at java.lang.Thread.run(Thread.java:534)

 

This caused Mux.start on the client side to hang indefinitely due to a
call to Object.wait without a timeout:

 

"pool-1-thread-154" prio=1 tid=0x0852f768 nid=0x565c in Object.wait()
[0x74809000..0x74809878]
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:429)
at com.sun.jini.jeri.internal.mux.Mux.start(Mux.java:207)
- locked <0x87ec4150> (a java.lang.Object)
at
net.jini.jeri.connection.ConnectionManager$OutboundMux.newRequest(Connec
tionManager.java:356)
- locked <0x87ec4158> (a
net.jini.jeri.connection.ConnectionManager$OutboundMux)
at
net.jini.jeri.connection.ConnectionManager$ReqIterator.next(ConnectionMa
nager.java:630)
- locked <0x87ec41c0> (a
net.jini.jeri.connection.ConnectionManager$ReqIterator)
at
net.jini.jeri.BasicObjectEndpoint$1.next(BasicObjectEndpoint.java:371)
at
net.jini.jeri.BasicInvocationHandler.invokeRemoteMethodOnce(BasicInvocat
ionHandler.java:708)
at
net.jini.jeri.BasicInvocationHandler.invokeRemoteMethod(BasicInvocationH
andler.java:659)
at
net.jini.jeri.BasicInvocationHandler.invoke(BasicInvocationHandler.java:
528)

 

The Reaper thread later needed the ConnectionManager$OutboundMux held by
the thread above:

 

"(JSK)
[EMAIL PROTECTED]
bc6b4].Reaper" daemon prio=1 tid=0x7e0231c8 nid=0x5017 waiting for
monitor entr
y [0x72bda000..0x72bda878]
at
net.jini.jeri.connection.ConnectionManager$OutboundMux.checkIdle(Connect
ionManager.java:377)
- waiting to lock <0x87ec4158> (a
net.jini.jeri.connection.ConnectionManager$OutboundMux)
at
net.jini.jeri.connection.ConnectionManager.checkIdle(ConnectionManager.j
ava:256)
- locked <0x86bdcfe8> (a net.jini.jeri.connection.ConnectionManager)
at
net.jini.jeri.connection.ConnectionManager$Reaper.run(ConnectionManager.
java:571)
- locked <0x86bdcfe8> (a net.jini.jeri.connection.ConnectionManager)
at com.sun.jini.thread.ThreadPool$Worker.run(ThreadPool.java:136)
at java.lang.Thread.run(Thread.java:534)

 

All other threads attempting to use the service then blocked attempting
to synchronize on the ConnectionManager held by the Reaper.  We use a
shared thread pool for all remote calls to all services.  With just one
of many service VMs in this bad state, the thread pool was eventually
exhausted.  All of its threads were waiting on the one ConnectionManager
while the other services were still operable.

 

I would like to see a RemoteException thrown on the client side if
Mux.start takes too long to establish a connection.  This would enable
us to discard the bad service and retry with another.

 

The patch that I would propose would look like this, except with the
hard-coded timeout value changed to a property:

 

            synchronized (muxLock) {

                long timeToWaitUntil = System.currentTimeMillis() +
15000;

                while (!muxDown && !clientConnectionReady) {

                    try {

                        muxLock.wait(15000);

                        if (System.currentTimeMillis() >=
timeToWaitUntil) {

                            setDown("client connection not ready within
15000 millis", null);

                        }

                    } catch (InterruptedException e) {

                        setDown("interrupt waiting for connection
header", e);

                    }

                }

                if (muxDown) {

                    IOException ioe = new IOException(muxDownMessage);

                    ioe.initCause(muxDownCause);

                    throw ioe;

                }

            }

 

Is this a reasonable approach?  I have verified that it has the desired
effect by simulating the SelectionManager$SelectLoop errors on the
server side (I'll leave that root cause analysis for a separate
discussion).

 

-Matt

critical Mux.start issue

Reply via email to