Mersenne Digest Friday, January 16 2004 Volume 01 : Number 1104
---------------------------------------------------------------------- Date: Thu, 15 Jan 2004 02:57:06 +0100 From: "Steinar H. Gunderson" <[EMAIL PROTECTED]> Subject: Re: Mersenne: double-check mismatches On Wed, Jan 14, 2004 at 05:00:18PM -0800, Max wrote: > Is any statistix on double-check mismatches available? > How often this happens? About 0.5%, IIRC. /* Steinar */ - -- Homepage: http://www.sesse.net/ _________________________________________________________________________ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Thu, 15 Jan 2004 09:25:04 +0100 From: "Hoogendoorn, Sander" <[EMAIL PROTECTED]> Subject: RE: Mersenne: double-check mismatches > Is any statistix on double-check mismatches available? > How often this happens? See http://www.mersenneforum.org/showthread.php?s=&threadid=1116 Combined error rate is between 3 and 4% _________________________________________________________________________ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Thu, 15 Jan 2004 14:41:49 -0500 (EST) From: [EMAIL PROTECTED] Subject: Mersenne: p95 to whom it may concern : my question is: will the p95 software run on a system of clusters?????? _________________________________________________________________________ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Thu, 15 Jan 2004 11:00:17 -0600 From: "Ryan Malayter" <[EMAIL PROTECTED]> Subject: RE: Mersenne: p95 [EMAIL PROTECTED] > to whom it may concern : > my question is: will the p95 software run on a system of > clusters?????? If you're asking if there's a cluster-aware version of Prime95, the answer is no. Because of the nautre of the error checking done on the server, there is no need to provide failover of the service. Nor is there a need to have two coordinated processes running on each node to increase performance - each node can run a copy of P95 independently, testing different exponents while maintaining the maximum performance possible. You can use prime95 on a cluster by running a separate copy on each node, and you'll get all the performance your hardware can provide. If your particular cluster architecture requires "mirror image" execution and file-systems, then you'll have a problem, because both nodes will perform exactly the same prime95 computations, and you'll get the same performance as a single machine. Regards, Ryan _________________________________________________________________________ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Thu, 15 Jan 2004 09:24:13 -0800 From: "John R Pierce" <[EMAIL PROTECTED]> Subject: Re: Mersenne: p95 >> to whom it may concern : >> my question is: will the p95 software run on a system of >> clusters?????? > > If you're asking if there's a cluster-aware version of Prime95, the > answer is no. Because of the nautre of the error checking done on the > server, there is no need to provide failover of the service. Nor is > there a need to have two coordinated processes running on each node to > increase performance - each node can run a copy of P95 independently, > testing different exponents while maintaining the maximum performance > possible. > > You can use prime95 on a cluster by running a separate copy on each > node, and you'll get all the performance your hardware can provide. If > your particular cluster architecture requires "mirror image" execution > and file-systems, then you'll have a problem, because both nodes will > perform exactly the same prime95 computations, and you'll get the same > performance as a single machine. I see a bit of confusion here... There are two distinctly different kinds of clustering in common use... High Performance, and High Availability. HA (High Availability) clusters are most frequently pairs of primary/standby computers, although sometimes they do some form of load balancing for specific sorts of applications (web servers, database servers most typically). This sort of architecture wouldn't be of any use with Mersenne Prime, its a 'self healing' system, its not mission critical, and has its own consistency tests. HP (High Performance) clusters by contrast are dozens, 100s, or even 1000s of nodes of indentical "cheap" computers loosely clusterd with a network, designed to run distributed computing applications. It would be quite simple to spawn a discrete copy of Prime95 (or the unix/linux mprime) in each node of one of these. If you in fact had 1000s of nodes, it might make sense to implement your own exponent allocation server ('primenet') so that all of the 1000s of nodes aren't directly accessing your internet connection to fetch exponents and return results, on the other hand, each instance of prime95/mprime typically only "checks in" every few days, so this really wouldn't be that big of a deal. _________________________________________________________________________ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Thu, 15 Jan 2004 12:29:23 -0600 From: "Ryan Malayter" <[EMAIL PROTECTED]> Subject: RE: Mersenne: p95 [John R Pierce] > I see a bit of confusion here... There are two distinctly > different kinds > of clustering in common use... High Performance, and High > Availability. Umm... How is that different from what I wrote? I didn't use the acronyms, but so what? There is no need for either HA or HP clustering with Prime95, which I illustrated fairly clearly. *I'm* not confused about cluster architectures at all; I administer both HA and HP cluster architectures in my own company. [John R Pierce] > It would be quite simple to spawn a discrete copy of Prime95 > (or the unix/linux mprime) in each node of one of these. This is *exactly* what I said, with only slightly different words. See below: [Ryan Malayter] > You can use prime95 on a cluster by running a separate copy > on each node, and you'll get all the performance your > hardware can provide. So I have to ask, John: Other than illustrating that you've read a clustering whitepaper once in your life, what was the point of your message? Don't call me "confused" and then repeat my message in a nearly verbatim manner! _________________________________________________________________________ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Thu, 15 Jan 2004 19:15:46 +0000 From: "Brian J. Beesley" <[EMAIL PROTECTED]> Subject: Re: Mersenne: double-check mismatches On Thursday 15 January 2004 01:00, Max wrote: > Hello! > > Is any statistix on double-check mismatches available? > How often this happens? ~2% of all runs are bad. > > If my result get mismatch with some other's will I get any notice about > that? No. But you can check the database - any results in the file "bad" have been rejected because of a residual mismatch. > Can I learn which of my results were confirmed by others? Yes. Check the "lucas_v" database file. > > P.S. Having periodical problems with overheating (coolers become dusty) > causing ``roundoff'' etc. hardware errors in mprime, Can you not run a hardware monitor program based on lm_sensors so that an alarm sounds at a temperature below that which causes problems? Most P4 chipsets will also automatically throttle the CPU clock if/when overheating occurs, so you will be notified by increasing iteration times rather than errors. > I don't much believe in computational results unless they're confirmed > by several parties. This attitude is entirely reasonable for long runs given consumer-grade hardware. > BTW, how error-proof is mprime ? On its own, not particularly. The computational cost of reasonably robust self-checking would be too much to bear. However, given that independent double checks are run, the _project system_ is pretty good - matching residuals mean that the chance of an error getting into the database as a result of a computational error is of the order of 1 in 10^20. _Detected_ errors - roundoff or otherwise - are not a problem. It's the undetected ones which are dangerous. If you have any ideas about how to improve this, I'm sure that George will consider them. There _are_ significant weaknesses in the project - in particular there is a _possibility_ that forged double check results could be submitted - that is one reason why I'm trying to triple-check all the exponents where both tests were run by the same user. Yes, I'm aware that a determined person with a working forging formula could bypass that check, too, but we've got to start somewhere. Regards Brian Beesley _________________________________________________________________________ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Thu, 15 Jan 2004 11:26:25 -0800 From: "John R Pierce" <[EMAIL PROTECTED]> Subject: Re: Mersenne: p95 > [John R Pierce] > > I see a bit of confusion here... There are two distinctly > > different kinds > > of clustering in common use... High Performance, and High > > Availability. > > Umm... How is that different from what I wrote? I didn't use the > acronyms, but so what? There is no need for either HA or HP clustering > with Prime95, which I illustrated fairly clearly. *I'm* not confused > about cluster architectures at all; I administer both HA and HP cluster > architectures in my own company. ... > So I have to ask, John: Other than illustrating that you've read a > clustering whitepaper once in your life, what was the point of your > message? Don't call me "confused" and then repeat my message in a nearly > verbatim manner! I was attempting to clarify the clustering terminology for the general audience out here, not correcting you, I'm sorry I didn't word my prologue better. I should have stated something more like "To clarify what Ryan said...." As the initial question was rather vague, I thought some further explanations might be appropriate. I recently did some research into clustering technology for a project at my company, and at first found myself confused, coming at it from outside, until I got a grasp on the two distinctly different clustering techniques (in my case, what we needed was the high availability sort, but my first contact with clustering software was the wrong sort entirely, and as I didn't understand the distinction I wasted a few weeks investigating the high performance stuff instead). The problem was, in my case at least, the people I first talked with were from the scientific market, and they just referred to them as 'clusters' without the 'high perf' or 'high availability' qualifier. BTW, I've done a *bit* more than read clustering whitepapers, I've configured and built a couple of prototype clusters, one a small linux "Oscar" high perf cluster to see how that stuff all works with MPI and so forth, then a Veritas Cluster with a pair of Solaris systems to evaluate how the high availability failover stuff could work with respect to our application... This by no means makes me an expert in this, as these were simplistic first pass evaluations, not production experience, but it does give me some basic insight into how an outsider might misunderstand clustering technologies. Sorry for the misunderstanding. - -john r pierce _________________________________________________________________________ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Thu, 15 Jan 2004 19:15:46 +0000 From: "Brian J. Beesley" <[EMAIL PROTECTED]> Subject: Re: Mersenne: double-check mismatches On Thursday 15 January 2004 01:00, Max wrote: > Hello! > > Is any statistix on double-check mismatches available? > How often this happens? ~2% of all runs are bad. > > If my result get mismatch with some other's will I get any notice about > that? No. But you can check the database - any results in the file "bad" have been rejected because of a residual mismatch. > Can I learn which of my results were confirmed by others? Yes. Check the "lucas_v" database file. > > P.S. Having periodical problems with overheating (coolers become dusty) > causing ``roundoff'' etc. hardware errors in mprime, Can you not run a hardware monitor program based on lm_sensors so that an alarm sounds at a temperature below that which causes problems? Most P4 chipsets will also automatically throttle the CPU clock if/when overheating occurs, so you will be notified by increasing iteration times rather than errors. > I don't much believe in computational results unless they're confirmed > by several parties. This attitude is entirely reasonable for long runs given consumer-grade hardware. > BTW, how error-proof is mprime ? On its own, not particularly. The computational cost of reasonably robust self-checking would be too much to bear. However, given that independent double checks are run, the _project system_ is pretty good - matching residuals mean that the chance of an error getting into the database as a result of a computational error is of the order of 1 in 10^20. _Detected_ errors - roundoff or otherwise - are not a problem. It's the undetected ones which are dangerous. If you have any ideas about how to improve this, I'm sure that George will consider them. There _are_ significant weaknesses in the project - in particular there is a _possibility_ that forged double check results could be submitted - that is one reason why I'm trying to triple-check all the exponents where both tests were run by the same user. Yes, I'm aware that a determined person with a working forging formula could bypass that check, too, but we've got to start somewhere. Regards Brian Beesley _________________________________________________________________________ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers _________________________________________________________________________ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Thu, 15 Jan 2004 14:37:42 -0600 From: "Ryan Malayter" <[EMAIL PROTECTED]> Subject: RE: Mersenne: p95 [John R Pierce] > Sorry for the misunderstanding. My apologies as well... Looking back, it appears I overreacted a bit, and your message didn't specifically label *me* "confused", but rather stated "I see a bit of confusion here". Regards, Ryan _________________________________________________________________________ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Thu, 15 Jan 2004 22:10:00 -0800 From: Max <[EMAIL PROTECTED]> Subject: Mersenne: Re: double-check mismatches On Thursday 15 January 2004 12:22, Brian J. Beesley wrote: >On Thursday 15 January 2004 01:00, Max wrote: >> Is any statistix on double-check mismatches available? >> How often this happens? >~2% of all runs are bad. It would be also interesting to learn how often the first run is bad, and how often is the second? It seems to me that first run should be bad more often than the second. Is that true? My reasoning is that first run is usually done on modern (fast/overclocked/unstable/etc) hardware while the second one is done on the old/slow but more stable/trusted hardware. Please correct me if I'm wrong. Thanks, Max _________________________________________________________________________ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Thu, 15 Jan 2004 22:49:24 -0800 From: Kevin Sexton <[EMAIL PROTECTED]> Subject: Re: Mersenne: Re: double-check mismatches Max wrote: > On Thursday 15 January 2004 12:22, Brian J. Beesley wrote: > > >On Thursday 15 January 2004 01:00, Max wrote: > > >> Is any statistix on double-check mismatches available? > >> How often this happens? > > >~2% of all runs are bad. > > It would be also interesting to learn how often the first run is bad, > and how often is the second? > > It seems to me that first run should be bad more often than the > second. Is that true? > My reasoning is that first run is usually done on modern > (fast/overclocked/unstable/etc) hardware while the second one is done > on the old/slow but more stable/trusted hardware. > > Please correct me if I'm wrong. > > Thanks, > Max > A reason for that to be reversed would be random "cosmic ray" errors. A faster computer allows less time per exponent for an error to occur. Say such an error occurs about once every year per computer (I know this is to often, just an example) the slow computer that finishes 1 per year would have about 1 error per exponent, one that completes 10 in a year would have one error on average every 10 exponents. > _________________________________________________________________________ > Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm > Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers > _________________________________________________________________________ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Fri, 16 Jan 2004 07:54:20 +0000 From: "Brian J. Beesley" <[EMAIL PROTECTED]> Subject: Re: Mersenne: Re: double-check mismatches On Friday 16 January 2004 06:10, Max wrote: > > It would be also interesting to learn how often the first run is bad, and > how often is the second? Yes - I don't think this information is readily available, though sometimes you can infer the order of completion from the program version number. To do the job properly either the "bad" database would need an extra field (date of submission) or a complete set of "cleared.txt" files would be required - and this would miss any results submitted manually. > > It seems to me that first run should be bad more often than the second. Is > that true? My reasoning is that first run is usually done on modern > (fast/overclocked/unstable/etc) hardware while the second one is done on > the old/slow but more stable/trusted hardware. Interesting theory - but surely the error rate would be expected to be proportional to the run length, which would tend to make fast hardware appear to be relatively more reliable - conversely smaller / lower power components (required to achieve high speed) would be more subject to quantum tunnelling errors. For those who think in terms of cosmic rays, this means a less energetic particle hit will be enough to flip the state of a bit. In any case the exponents ~10,000,000 which are being double checked now were originally tested on "leading edge" hardware about 4 years ago, when overclocking was by no means unknown but was often done without the sort of sophisticated cooling which is readily available these days. Regards Brian Beesley _________________________________________________________________________ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Fri, 16 Jan 2004 07:54:20 +0000 From: "Brian J. Beesley" <[EMAIL PROTECTED]> Subject: Re: Mersenne: Re: double-check mismatches On Friday 16 January 2004 06:10, Max wrote: > > It would be also interesting to learn how often the first run is bad, and > how often is the second? Yes - I don't think this information is readily available, though sometimes you can infer the order of completion from the program version number. To do the job properly either the "bad" database would need an extra field (date of submission) or a complete set of "cleared.txt" files would be required - and this would miss any results submitted manually. > > It seems to me that first run should be bad more often than the second. Is > that true? My reasoning is that first run is usually done on modern > (fast/overclocked/unstable/etc) hardware while the second one is done on > the old/slow but more stable/trusted hardware. Interesting theory - but surely the error rate would be expected to be proportional to the run length, which would tend to make fast hardware appear to be relatively more reliable - conversely smaller / lower power components (required to achieve high speed) would be more subject to quantum tunnelling errors. For those who think in terms of cosmic rays, this means a less energetic particle hit will be enough to flip the state of a bit. In any case the exponents ~10,000,000 which are being double checked now were originally tested on "leading edge" hardware about 4 years ago, when overclocking was by no means unknown but was often done without the sort of sophisticated cooling which is readily available these days. Regards Brian Beesley _________________________________________________________________________ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers _________________________________________________________________________ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Fri, 16 Jan 2004 01:18:15 -0800 From: Kevin Sexton <[EMAIL PROTECTED]> Subject: Re: Mersenne: Re: double-check mismatches As far as I know overclocking is not done to the extent that it was in the past, when with proper cooling some processors would operate at a signifigant percentage increase in speed. New processors operate in the microwave range when you look at their frequencies. You can imagine wierd things that may happen in a chip when you go from 3.2 Ghz to 4.2, strange things with induction and capacitance of the traces, a signal on one trace jumping to another as an RF signal, it just doesn't seem to me to be worth the effort to push a chip beyond it's rated speed anymore. Also the bottleneck is not so much the processor, but memory, chipset and graphics card have a large impact on how fast that game plays. It is now more work, for less benefit to overclock. Brian J. Beesley wrote: >On Friday 16 January 2004 06:10, Max wrote: > > >>It would be also interesting to learn how often the first run is bad, and >>how often is the second? >> >> > >Yes - I don't think this information is readily available, though sometimes >you can infer the order of completion from the program version number. > >To do the job properly either the "bad" database would need an extra field >(date of submission) or a complete set of "cleared.txt" files would be >required - and this would miss any results submitted manually. > > >>It seems to me that first run should be bad more often than the second. Is >>that true? My reasoning is that first run is usually done on modern >>(fast/overclocked/unstable/etc) hardware while the second one is done on >>the old/slow but more stable/trusted hardware. >> >> > >Interesting theory - but surely the error rate would be expected to be >proportional to the run length, which would tend to make fast hardware appear >to be relatively more reliable - conversely smaller / lower power components >(required to achieve high speed) would be more subject to quantum tunnelling >errors. For those who think in terms of cosmic rays, this means a less >energetic particle hit will be enough to flip the state of a bit. > >In any case the exponents ~10,000,000 which are being double checked now were >originally tested on "leading edge" hardware about 4 years ago, when >overclocking was by no means unknown but was often done without the sort of >sophisticated cooling which is readily available these days. > >Regards >Brian Beesley >_________________________________________________________________________ >Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm >Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers >_________________________________________________________________________ >Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm >Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers > > > _________________________________________________________________________ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Sat, 17 Jan 2004 02:32:35 +0000 From: Daran <[EMAIL PROTECTED]> Subject: Re: Mersenne: double-check mismatches On Thu, Jan 15, 2004 at 07:15:46PM +0000, Brian J. Beesley wrote: > ...matching > residuals mean that the chance of an error getting into the database as a > result of a computational error is of the order of 1 in 10^20. That's per exponent, isn't it? The chance that one of the roughly quarter million status-doublechecked exponents being in error is about five orders of magnitudes higher. Still acceptible, or at least a minor consern in comparison to the other security issues. > Regards > Brian Beesley Daran G. _________________________________________________________________________ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ End of Mersenne Digest V1 #1104 *******************************