Re: [Prime] Primenet server reliability issues

Brian Beesley Wed, 04 Jan 2006 05:56:13 -0800

On Tuesday 03 January 2006 02:22, George Woltman wrote:
>
> Restarting the server when it goes dead is a minimum amount of effort.


Maybe, but it's more effort than is desireable; also I get the impression that 
monitoring the server to check whether or not it's working is effort - maybe 
not much - which needs to be, but isn't actually, applied continuously.
>
> My best guess is it seems to die for three reasons.
>
> 1)  Bad data sent by client that isn't rigorously checked by
> server.  Buffer overruns
> on user names or passwords or some other field are probably getting
> the primenet server application in a funny hung state.

In my experience _everything_ connected to the net gets bombed by badly 
formatted service requests - even if the client is impeccable, those created 
by crackers looking for vulnerable systems. The situation is not likely to 
improve in the short to medium term, if ever. It's _essential_ to keep 
software up to date so that known vulnerabilities are not exposed.
>
> 2)  The backend database goes down.  An upgrade to SQLServer 2005 might
> help. If the primenet server application handled errors better, that would
> help too.  This is
> the infamous ERROR 3 problem.

Sure. But I can't help but think that a crashing database is maybe a sign of 
at least one of enemy action (hostile input) or software that needs to be 
patched or upgraded. Also a little investment in automatic service monitoring 
might at least enable a failed database service to be resurrected in 
reasonably short order without manual intervention.

Enemy action can be countered by extra validation, or by changing the 
client/server protocol to use some reasonably robust authentication mechanism 
- e.g. message serial numbering & timestamping combined with 
cryptographically generated signature. I don't see why individual clients 
shouldn't have to support this extra effort; everybody has a web browser 
which does this sort of thing!
>
> 3)  The manual web pages provide an opportunity to flood the server
> with arbitrary
> text results.  Buffer overruns or mis-parsing this text might have led to
> some outages.

I thought we already had a pretty strict volume limiter - in the days before 
the manual assignments page stopped working altogether?
>
> 4)  This isn't really an outage, but at the top of each hour building
> the hourly
> reports takes about ten minutes.  When the stats reports were first brought
> online, the server was managing far, far fewer exponents.

Hum. I used to run a database transaction logging service where the 
transactions sometimes ran into hundreds of megabytes per hour. I think this 
is a couple of decimal orders of magnitude heavier than PrimeNet? I ran 
reports only daily, but there were a lot more than PrimeNet generates (at 
least the ones published on the web pages). The report generation did take a 
few hours per day but I only had a 1 GHz PIII and a slowish disk subsystem to 
run it on.

I don't know what you are using but my main tools were bzip2 (without which 
retained data would have cost far too much disk space, but which does consume 
considerable CPU resources), bash, grep, sort and awk.

Just before I left that job I had to move this from the linux "PC" (actually 
it was a rack mounted server) onto a Sun Ultra Sparc system. The effort was 
minimal - not far from zero - because of the simple and open structure.

Finally it occurs to me that one clear way of solving this problem would be to 
run the report generation on a seperate system - a basic server with 
read-only access to the database, to which the server permits local access 
only, would seem to be a pretty cheap way of getting the CPU cycles needed to 
generate the reports without bogging the server down. If we really need to 
supply extra hardware then I'm sure US$1,000 would cover it.
>
> For the curious, a spec for the next  client-server interface is at
> http://v5.mersenne.org/v5design/v5webAPI_0.96.html

Thanks.
>
> >At that stage it just
> >might be possible that someone would invest time in rewriting the server
> > code so that it can be implemented in a distributed, hardware/OS
> > independent method so that reliance on a single box and effectively a
> > single sysadmin can be removed from this project.
>
> All efforts have been directed toward replacing the primenet server
> application
> with a new from scratch more robust and bulletproof application.  Scott and
> I have been working on it for about 4 months, but at nowhere near the 40
> hours a week required to make this happen quickly.

Need any help? I could probably chip in one day a week...

It seems to be a necessity to do something with the client to prevent it from 
hanging. I'd suggest removing the PrimeNet comms from the client altogether! 
Have a seperate program which would either be forked by the client when 
needed, and terminate itself when done, or run a seperate background process 
to handle the comms. That way the client could run uninterrupted (unless it 
happens to run out of work altogether).

One thing which could be done - probably without a lot of effort - would be to 
have a number of "PrimeNet servers" acting as intermediates between the user 
client and the main server. That way the main server would only have to 
interact with the sub-servers, so it could be effectively protected from 
hostile traffic by firewalling. Individual sub-servers might still crash (or 
be DoSsed) but this wouldn't matter to anything like the same extent if the 
client were to try a different sub-server every time on a "round robin" 
basis.

If the client to server transactions were encrypted the critical bits needn't 
even be decrypted at the sub-servers i.e. a rogue sub-server needn't endanger 
the project as a whole, and "private" data like new prime discoveries could 
be effectively hidden from the sub-server administrator. The sub-servers 
could then be hosted on a volunteer basis so that the project needn't shell 
out for more hardware to expand its throughput.

Just a few thoughts.

Anyone else any comments?

Please try to remember that I'm trying to be constructive - what can we do to 
make a great project better - not just carping on about deficiencies. 
Obviously the project has grown - as have the hazards of connecting systems 
to the network - what made perfect sense 10 years ago may no longer be wholly 
adequate for reasons which were not then forseeable. Maybe the 10th birthday 
of GIMPS is a good time to re-evaluate the relationships between client, 
server and master database.

Regards
Brian Beesley
_______________________________________________
Prime mailing list
[email protected]
http://hogranch.com/mailman/listinfo/prime

Re: [Prime] Primenet server reliability issues

Reply via email to