turbine-jcs-user  

RE: Master cache machine no longer reachable causes spurious threads?

Matthew Cooke
Thu, 06 Jan 2005 04:07:59 -0800

Setting a timeout on the RMI socket is done on construction using
something like this: 

RMISocketFactory.setSocketFactory(new RMISocketFactory() {
                                public Socket createSocket(String host, int 
port) throws IOException {
                                        Socket socket = new Socket(host, port);
                                        socket.setSoTimeout(timeoutMillis);
                                        socket.setSoLinger(false, 0);
                                        return socket;
                                }
                                public ServerSocket createServerSocket(int 
port) throws IOException {
                                        return new ServerSocket(port);
                                }
                        });

I haven't tested this solution, but I had a peak at the code is JCS and
it didn't look too hard. async, blocking threads with a timeout might be
a good idea, but as timing out the RMI thread at the socket level looks
simpler it might be worth putting that in in the interim.

Matt.

On Wed, 2005-01-05 at 13:12 -0800, Smuts, Aaron wrote:
> I could use doug Lea's Future Result and call timedGet(millis).  
> 
> http://gee.cs.oswego.edu/dl/classes/EDU/oswego/cs/dl/util/concurrent/FutureResult.html
> 
> This would add some overhead, but it would be safer.  
> 
> If a get timeout, I can assume that the server is down an through an error.  
> Then the client will Zombie and start balking, just as it does when we 
> shutdown the server and not the machine.
> 
> I could start by putting it in just the RMI client, since we don't have the 
> same problem elsewhere.  I'll try something.
> 
> I think we need a general threadpool configuration mechanism. . . .
> 
> Aaron 
> 
> -----Original Message-----
> From: Hanson Char [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, January 05, 2005 12:57 PM
> To: Turbine JCS Users List
> Subject: Re: Master cache machine no longer reachable causes spurious threads?
> 
> Just an idea: turn each get operation into an asyn operation (using a thread 
> from a thread pool) with a optional timeout parameter (with say a default of 
> 5 secs).
> 
> If the get doesn't finish within the timeout period, just terminate the 
> thread and return null.
> 
> So RMI or not, it's guaranteed not to block under all circumstances. 
> Probably something from Doug's concurrent (backport) library can be taken 
> advantage of.
> 
> H
> 
> 
> On Wed, 5 Jan 2005 09:52:38 -0800, Smuts, Aaron <[EMAIL PROTECTED]> wrote:
> > If you know of a solution, please send it to me.
> > 
> > Thanks,
> > 
> > Aaron
> > 
> > -----Original Message-----
> > From: Matthew Cooke [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, January 05, 2005 2:12 AM
> > To: Turbine JCS Users List
> > Subject: Re: Master cache machine no longer reachable causes spurious 
> > threads?
> > 
> > Master-remote cache Sun JDK 1.4 on Redhat linux 7.3.
> > Client machines were Sun JDK 1.4 on linux(prod) and winXP(testing).
> > 
> > The problem was reproduced by executing a "shutdown -h now" on the 
> > mastercache machine without cleanly killing the master-remote cache running 
> > on it first. Client machines then hang on get's for much longer than 
> > 30seconds before throwing a noroutetohost.
> > 
> > Currently we have no fix other than, other than Don't kill the master cache 
> > machine suddenly and if the hardware dies panic. Someone was investigating 
> > modifying the rmi settings but without success. I know it is possible by 
> > modifying the jcs/rmi code as i see many other RMI users have had similar 
> > issues (google) and a fix is documented, i can probably dig it up if useful.
> > 
> > Matt.
> > 
> > Smuts, Aaron wrote:
> > > I can't reproduce the issue.  I can get 30 second pauses if I pull the 
> > > network cable out, but not 15 minute locks.  I'm running the remote 
> > > server on a windows box and hitting it from a linux box.  I can disrupt 
> > > things sometimes if I pull the network cable out of the windows box 
> > > running the server.  If I just kill the server everything is fine.  . . . 
> > >   I'm running jdk 1.4.2_04.
> > >
> > > What jdk and os are you using?
> > >
> > > Aaron
> > >
> > >
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Smuts, Aaron [mailto:[EMAIL PROTECTED]
> > > Sent: Tuesday, January 04, 2005 1:53 PM
> > > To: Turbine JCS Users List; [EMAIL PROTECTED]
> > > Subject: RE: Master cache machine no longer reachable causes spurious 
> > > threads?
> > >
> > > The various RMI properties that can be set are listed here.
> > >
> > > http://java.sun.com/j2se/1.4.2/docs/guide/rmi/sunrmiproperties.html
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Smuts, Aaron [mailto:[EMAIL PROTECTED]
> > > Sent: Tuesday, January 04, 2005 1:40 PM
> > > To: [EMAIL PROTECTED]
> > > Cc: turbine-jcs-user@jakarta.apache.org
> > > Subject: RE: Master cache machine no longer reachable causes spurious 
> > > threads?
> > >
> > > Remove, and put requests to the remote rmi server are done 
> > > asynchronously; however, get's are synchronous.
> > >
> > > If a get locks up, then it could potentially block other put and remove 
> > > requests locally.  Are you seeing all requests block.
> > >
> > > Why is the situation different if the machine goes down, versus the rmi 
> > > server not running?  I haven't dug into the sun rmi code very far.
> > >
> > > What do you suggest?
> > >
> > > You could run in put only mode with remove on put set to false, if you 
> > > frequently have machines shutting down thereby killing the remote server.
> > >
> > > Aaron
> > >
> > > -----Original Message-----
> > > From: Tim Cocks [mailto:[EMAIL PROTECTED]
> > > Sent: Tuesday, December 07, 2004 9:53 AM
> > > To: Smuts, Aaron
> > > Cc: turbine-jcs-user@jakarta.apache.org
> > > Subject: Re: Master cache machine no longer reachable causes spurious 
> > > threads?
> > >
> > > Thanks for your time.  We are using the remote server.  We have found it 
> > > is almost exactly 15 minutes between when the machine running the master 
> > > cache shuts down and when the clients realise the remote cache is no 
> > > longer accessible.  During those 15 minutes, calls to JCS block.
> > >  After the 15 minutes, the calls return.
> > >
> > > The problem appears to be an RMI one. The fact the delay is consistently 
> > > ~15 minutes seems to imply the timeout is working correctly, but is set 
> > > too high.  We considered changing the RMI timeouts by overriding 
> > > RMISocketFactory. Unfortunately this would require us to change the JCS 
> > > source code, something we would like to avoid.
> > >
> > > Tim
> > >
> > > On Mon, 6 Dec 2004 13:45:59 -0800, Smuts, Aaron <[EMAIL PROTECTED]> wrote:
> > >
> > >>I'll need to look into this.
> > >>
> > >>You are using the remote server?  The client reconnect must not be timing 
> > >>out properly.
> > >>
> > >>Aaron
> > >>
> > >>
> > >>
> > >>
> > >>-----Original Message-----
> > >>From: Tim Cocks [mailto:[EMAIL PROTECTED]
> > >>Sent: Friday, December 03, 2004 2:39 AM
> > >>To: turbine-jcs-user@jakarta.apache.org
> > >>Subject: Master cache machine no longer reachable causes spurious threads?
> > >>
> > >>We use JCS outside of Turbine on about 20 machines connected to a JCS 
> > >>master cache.
> > >>
> > >>On occasion we have had to kill the JCS master cache process and have 
> > >>observed the client machines gracefully realise the master cache is no 
> > >>longer available.  They continue to work indefinitely, albeit without 
> > >>access to the master cache.
> > >>
> > >>However, when the machine running the master cache goes down completely 
> > >>the clients continue attempting to connect.  In the process, they are 
> > >>creating more and more blocking threads and the JVM eventually terminates.
> > >>
> > >>Is this a known problem?  If so, are there any solutions?
> > >>
> > >>Thanks in advance for any help,
> > >>
> > >>Tim Cocks
> > >>
> > >>--------------------------------------------------------------------
> > >>-
> > >>To unsubscribe, e-mail:
> > >>[EMAIL PROTECTED]
> > >>For additional commands, e-mail:
> > >>[EMAIL PROTECTED]
> > >>
> > >>
> > >
> > >
> > > --------------------------------------------------------------------
> > > -
> > > To unsubscribe, e-mail:
> > > [EMAIL PROTECTED]
> > > For additional commands, e-mail:
> > > [EMAIL PROTECTED]
> > >
> > >
> > > --------------------------------------------------------------------
> > > -
> > > To unsubscribe, e-mail:
> > > [EMAIL PROTECTED]
> > > For additional commands, e-mail:
> > > [EMAIL PROTECTED]
> > >
> > >
> > > --------------------------------------------------------------------
> > > -
> > > To unsubscribe, e-mail:
> > > [EMAIL PROTECTED]
> > > For additional commands, e-mail:
> > > [EMAIL PROTECTED]
> > >
> > >
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: 
> > [EMAIL PROTECTED]
> > For additional commands, e-mail: 
> > [EMAIL PROTECTED]
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: 
> > [EMAIL PROTECTED]
> > For additional commands, e-mail: 
> > [EMAIL PROTECTED]
> > 
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
-- 
Matthew Cooke <[EMAIL PROTECTED]>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]