On Tuesday 03 January 2006 02:22, George Woltman wrote: > > Restarting the server when it goes dead is a minimum amount of effort.
Maybe, but it's more effort than is desireable; also I get the impression that monitoring the server to check whether or not it's working is effort - maybe not much - which needs to be, but isn't actually, applied continuously. > > My best guess is it seems to die for three reasons. > > 1) Bad data sent by client that isn't rigorously checked by > server. Buffer overruns > on user names or passwords or some other field are probably getting > the primenet server application in a funny hung state. In my experience _everything_ connected to the net gets bombed by badly formatted service requests - even if the client is impeccable, those created by crackers looking for vulnerable systems. The situation is not likely to improve in the short to medium term, if ever. It's _essential_ to keep software up to date so that known vulnerabilities are not exposed. > > 2) The backend database goes down. An upgrade to SQLServer 2005 might > help. If the primenet server application handled errors better, that would > help too. This is > the infamous ERROR 3 problem. Sure. But I can't help but think that a crashing database is maybe a sign of at least one of enemy action (hostile input) or software that needs to be patched or upgraded. Also a little investment in automatic service monitoring might at least enable a failed database service to be resurrected in reasonably short order without manual intervention. Enemy action can be countered by extra validation, or by changing the client/server protocol to use some reasonably robust authentication mechanism - e.g. message serial numbering & timestamping combined with cryptographically generated signature. I don't see why individual clients shouldn't have to support this extra effort; everybody has a web browser which does this sort of thing! > > 3) The manual web pages provide an opportunity to flood the server > with arbitrary > text results. Buffer overruns or mis-parsing this text might have led to > some outages. I thought we already had a pretty strict volume limiter - in the days before the manual assignments page stopped working altogether? > > 4) This isn't really an outage, but at the top of each hour building > the hourly > reports takes about ten minutes. When the stats reports were first brought > online, the server was managing far, far fewer exponents. Hum. I used to run a database transaction logging service where the transactions sometimes ran into hundreds of megabytes per hour. I think this is a couple of decimal orders of magnitude heavier than PrimeNet? I ran reports only daily, but there were a lot more than PrimeNet generates (at least the ones published on the web pages). The report generation did take a few hours per day but I only had a 1 GHz PIII and a slowish disk subsystem to run it on. I don't know what you are using but my main tools were bzip2 (without which retained data would have cost far too much disk space, but which does consume considerable CPU resources), bash, grep, sort and awk. Just before I left that job I had to move this from the linux "PC" (actually it was a rack mounted server) onto a Sun Ultra Sparc system. The effort was minimal - not far from zero - because of the simple and open structure. Finally it occurs to me that one clear way of solving this problem would be to run the report generation on a seperate system - a basic server with read-only access to the database, to which the server permits local access only, would seem to be a pretty cheap way of getting the CPU cycles needed to generate the reports without bogging the server down. If we really need to supply extra hardware then I'm sure US$1,000 would cover it. > > For the curious, a spec for the next client-server interface is at > http://v5.mersenne.org/v5design/v5webAPI_0.96.html Thanks. > > >At that stage it just > >might be possible that someone would invest time in rewriting the server > > code so that it can be implemented in a distributed, hardware/OS > > independent method so that reliance on a single box and effectively a > > single sysadmin can be removed from this project. > > All efforts have been directed toward replacing the primenet server > application > with a new from scratch more robust and bulletproof application. Scott and > I have been working on it for about 4 months, but at nowhere near the 40 > hours a week required to make this happen quickly. Need any help? I could probably chip in one day a week... It seems to be a necessity to do something with the client to prevent it from hanging. I'd suggest removing the PrimeNet comms from the client altogether! Have a seperate program which would either be forked by the client when needed, and terminate itself when done, or run a seperate background process to handle the comms. That way the client could run uninterrupted (unless it happens to run out of work altogether). One thing which could be done - probably without a lot of effort - would be to have a number of "PrimeNet servers" acting as intermediates between the user client and the main server. That way the main server would only have to interact with the sub-servers, so it could be effectively protected from hostile traffic by firewalling. Individual sub-servers might still crash (or be DoSsed) but this wouldn't matter to anything like the same extent if the client were to try a different sub-server every time on a "round robin" basis. If the client to server transactions were encrypted the critical bits needn't even be decrypted at the sub-servers i.e. a rogue sub-server needn't endanger the project as a whole, and "private" data like new prime discoveries could be effectively hidden from the sub-server administrator. The sub-servers could then be hosted on a volunteer basis so that the project needn't shell out for more hardware to expand its throughput. Just a few thoughts. Anyone else any comments? Please try to remember that I'm trying to be constructive - what can we do to make a great project better - not just carping on about deficiencies. Obviously the project has grown - as have the hazards of connecting systems to the network - what made perfect sense 10 years ago may no longer be wholly adequate for reasons which were not then forseeable. Maybe the 10th birthday of GIMPS is a good time to re-evaluate the relationships between client, server and master database. Regards Brian Beesley _______________________________________________ Prime mailing list [email protected] http://hogranch.com/mailman/listinfo/prime
