Re: [Prime] Primenet server reliability issues

Brian Beesley Mon, 02 Jan 2006 10:53:09 -0800

On Monday 02 January 2006 16:48, Rhesa Rozendaal wrote:
> Michael Vang wrote:
> >
> > Set checkins to 7 days. Keep a large queue of work.
>
> I will give this a try. I'm currently checking in every 2 days, precisely
> to be able to see if any processes hang.
> I wonder if it will help though, since most of the time it's the instances
> doing factorization that hang. I don't really know if they do when
> reporting results, or just updating their status. Typically, they have
> results every 5 days or so.


Repeats my experiences almost exactly. I've had to unhang a linux client twice 
since Xmas. Incidentally plain kill doesn't work but kill -9 does. There 
doesn't seem to be any impact in terms of resource drain (memory leak etc) on 
my systems though kill -9 is documented as being somewhat dangerous for this 
reason.

> Setting the checkin interval to 7 days may mean that some clients might be
> idle for well over a week before I find out about it.

Well there are means - e.g. on a linux box monitor cpu usage (client use drops 
to zero when it hangs) & mail administrator and/or force client to restart if 
it's obviously jammed - an unexpectedly old p??????? file is also a good 
indication of a misbehaving client.
>
> > to become a little less reliant on the server. Fortunately, Prime95
> > and mprime are designed so this is possible.
>
> Except that the client on linux doesn't cleanly deal with server outages,
> so until that is resolved, I don't expect my frustration to end ;)

The windoze client hangs too....
>
> > Please have patience with the server. GW and SK are very aware of the
> > situation and have done everything possible to make it work as
> > smoothly as possible. I know it appears as if the server is a
> > neglected box in a dark closet somewhere, but rest assured a lot of
> > work and effort is being applied to it daily.
>
> I am patient, no worries. I've been a part of GIMPS since 2000, and don't
> feel like giving up at all. I believe you when you say that you guys spend
> a lot of effort on the server, but I can't help wondering why it needs so
> much attention in the first place. None of my web apps need much
> intervention at all (and I have several high traffic ones under my care),
> and I don't even consider myself that much of a wizard sysadmin. Do you
> have any idea why your server is falling off the net so often? And why has
> it gotten so much worse in the past few months, compared to a couple of
> years back?
>
> The main reason I'm speaking out at this point, is that these issues have
> been going on for quite a while now, without any sign of improvement.

I agree. Absolutely.

> I 
> would like to see some information on what's wrong, and what you guys are
> doing to fix it. The primenet server appears to be a neglected box stuffed
> away somewhere out of sight because I never saw any news about the ongoing
> work. Some PR in that respect would have kept me silent :)
>

Yes.

Also the manual testing pages have been broken (as in _totally_ unusable) for 
many months. This is a pain to those of us who like to go around tidying up 
loose ends.

There are also a number of things which IMO should be added to the server code 
e.g. avoidance of double check assignments being issued to the same user ID 
and/or computer ID as the first test.

Perhaps what we really need to do at this stage is to publish a formal spec 
for the server and its interaction with the client. At that stage it just 
might be possible that someone would invest time in rewriting the server code 
so that it can be implemented in a distributed, hardware/OS independent 
method so that reliance on a single box and effectively a single sysadmin can 
be removed from this project.

GIMPS/PrimeNet has made amazing discoveries and is still fun to work with - 
but IMO the current position wrt the server renders its long-term 
sustainability somewhat doubtful.

Regards
Brian Beesley
_______________________________________________
Prime mailing list
[email protected]
http://hogranch.com/mailman/listinfo/prime

Re: [Prime] Primenet server reliability issues

Reply via email to