Mersenne Digest       Friday, January 16 2004       Volume 01 : Number 1104




----------------------------------------------------------------------

Date: Thu, 15 Jan 2004 02:57:06 +0100
From: "Steinar H. Gunderson" <[EMAIL PROTECTED]>
Subject: Re: Mersenne: double-check mismatches

On Wed, Jan 14, 2004 at 05:00:18PM -0800, Max wrote:
> Is any statistix on double-check mismatches available?
> How often this happens?

About 0.5%, IIRC.

/* Steinar */
- -- 
Homepage: http://www.sesse.net/
_________________________________________________________________________
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Thu, 15 Jan 2004 09:25:04 +0100
From: "Hoogendoorn, Sander" <[EMAIL PROTECTED]>
Subject: RE: Mersenne: double-check mismatches

> Is any statistix on double-check mismatches available?
> How often this happens?

See http://www.mersenneforum.org/showthread.php?s=&threadid=1116

Combined error rate is between 3 and 4%
_________________________________________________________________________
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Thu, 15 Jan 2004 14:41:49 -0500 (EST)
From: [EMAIL PROTECTED]
Subject: Mersenne: p95

to whom it may concern :
my question is: will the p95 software run on a system of clusters??????

_________________________________________________________________________
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Thu, 15 Jan 2004 11:00:17 -0600
From: "Ryan Malayter" <[EMAIL PROTECTED]>
Subject: RE: Mersenne: p95

[EMAIL PROTECTED]
> to whom it may concern :
> my question is: will the p95 software run on a system of 
> clusters??????

If you're asking if there's a cluster-aware version of Prime95, the
answer is no. Because of the nautre of the error checking done on the
server, there is no need to provide failover of the service. Nor is
there a need to have two coordinated processes running on each node to
increase performance - each node can run a copy of P95 independently,
testing different exponents while maintaining the maximum performance
possible.

You can use prime95 on a cluster by running a separate copy on each
node, and you'll get all the performance your hardware can provide. If
your particular cluster architecture requires "mirror image" execution
and file-systems, then you'll have a problem, because both nodes will
perform exactly the same prime95 computations, and you'll get the same
performance as a single machine.

Regards,
        Ryan
_________________________________________________________________________
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Thu, 15 Jan 2004 09:24:13 -0800
From: "John R Pierce" <[EMAIL PROTECTED]>
Subject: Re: Mersenne: p95

>> to whom it may concern :
>> my question is: will the p95 software run on a system of
>> clusters??????
>
> If you're asking if there's a cluster-aware version of Prime95, the
> answer is no. Because of the nautre of the error checking done on the
> server, there is no need to provide failover of the service. Nor is
> there a need to have two coordinated processes running on each node to
> increase performance - each node can run a copy of P95 independently,
> testing different exponents while maintaining the maximum performance
> possible.
>
> You can use prime95 on a cluster by running a separate copy on each
> node, and you'll get all the performance your hardware can provide. If
> your particular cluster architecture requires "mirror image" execution
> and file-systems, then you'll have a problem, because both nodes will
> perform exactly the same prime95 computations, and you'll get the same
> performance as a single machine.


I see a bit of confusion here...   There are two distinctly different kinds 
of clustering in common use... High Performance, and High Availability.



HA (High Availability) clusters are most frequently pairs of primary/standby 
computers, although sometimes they do some form of load balancing for 
specific sorts of applications (web servers, database servers most 
typically).   This sort of architecture wouldn't be of any use with Mersenne 
Prime, its  a 'self healing' system, its not mission critical, and has its 
own consistency tests.



HP (High Performance) clusters by contrast are dozens, 100s, or even 1000s 
of nodes of indentical "cheap" computers loosely clusterd with a network, 
designed to run distributed computing applications.   It would be quite 
simple to spawn a discrete copy of Prime95 (or the unix/linux mprime) in 
each node of one of these.   If you in fact had 1000s of nodes, it might 
make sense to implement your own exponent allocation server ('primenet') so 
that all of the 1000s of nodes aren't directly accessing your internet 
connection to fetch exponents and return results, on the other hand, each 
instance of prime95/mprime typically only "checks in" every few days, so 
this really wouldn't be that big of a deal.





_________________________________________________________________________
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Thu, 15 Jan 2004 12:29:23 -0600
From: "Ryan Malayter" <[EMAIL PROTECTED]>
Subject: RE: Mersenne: p95

[John R Pierce]
> I see a bit of confusion here...   There are two distinctly 
> different kinds 
> of clustering in common use... High Performance, and High 
> Availability.

Umm... How is that different from what I wrote? I didn't use the
acronyms, but so what? There is no need for either HA or HP clustering
with Prime95, which I illustrated fairly clearly. *I'm* not confused
about cluster architectures at all; I administer both HA and HP cluster
architectures in my own company. 

[John R Pierce] 
> It would be quite simple to spawn a discrete copy of Prime95 
> (or the unix/linux mprime) in each node of one of these.   

This is *exactly* what I said, with only slightly different words.
See below:

[Ryan Malayter]
> You can use prime95 on a cluster by running a separate copy 
> on each node, and you'll get all the performance your 
> hardware can provide.

So I have to ask, John: Other than illustrating that you've read a
clustering whitepaper once in your life, what was the point of your
message? Don't call me "confused" and then repeat my message in a nearly
verbatim manner!
_________________________________________________________________________
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Thu, 15 Jan 2004 19:15:46 +0000
From: "Brian J. Beesley" <[EMAIL PROTECTED]>
Subject: Re: Mersenne: double-check mismatches

On Thursday 15 January 2004 01:00, Max wrote:
> Hello!
>
> Is any statistix on double-check mismatches available?
> How often this happens?

~2% of all runs are bad.
>
> If my result get mismatch with some other's will I get any notice about
> that?

No. But you can check the database - any results in the file "bad" have been 
rejected because of a residual mismatch.

> Can I learn which of my results were confirmed by others?

Yes. Check the "lucas_v" database file.
>
> P.S. Having periodical problems with overheating (coolers become dusty)
> causing ``roundoff'' etc. hardware errors in mprime,

Can you not run a hardware monitor program based on lm_sensors so that an 
alarm sounds at a temperature below that which causes problems? Most P4 
chipsets will also automatically throttle the CPU clock if/when overheating 
occurs, so you will be notified by increasing iteration times rather than 
errors.

> I don't much believe in computational results unless they're confirmed
> by several parties.

This attitude is entirely reasonable for long runs given consumer-grade 
hardware.

> BTW, how error-proof is mprime ?

On its own, not particularly. The computational cost of reasonably robust 
self-checking would be too much to bear. However, given that independent 
double checks are run, the _project system_ is pretty good - matching 
residuals mean that the chance of an error getting into the database as a 
result of a computational error is of the order of 1 in 10^20.

_Detected_ errors - roundoff or otherwise - are not a problem. It's the 
undetected ones which are dangerous.

If you have any ideas about how to improve this, I'm sure that George will 
consider them.

There _are_ significant weaknesses in the project - in particular there is a 
_possibility_ that forged double check results could be submitted - that is 
one reason why I'm trying to triple-check all the exponents where both tests 
were run by the same user. Yes, I'm aware that a determined person with a 
working forging formula could bypass that check, too, but we've got to start 
somewhere.

Regards
Brian Beesley
_________________________________________________________________________
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Thu, 15 Jan 2004 11:26:25 -0800
From: "John R Pierce" <[EMAIL PROTECTED]>
Subject: Re: Mersenne: p95

>  [John R Pierce]
>  > I see a bit of confusion here...   There are two distinctly
>  > different kinds
>  > of clustering in common use... High Performance, and High
>  > Availability.
>
>  Umm... How is that different from what I wrote? I didn't use the
>  acronyms, but so what? There is no need for either HA or HP clustering
>  with Prime95, which I illustrated fairly clearly. *I'm* not confused
>  about cluster architectures at all; I administer both HA and HP cluster
>  architectures in my own company.
...
>  So I have to ask, John: Other than illustrating that you've read a
>  clustering whitepaper once in your life, what was the point of your
>  message? Don't call me "confused" and then repeat my message in a nearly
>  verbatim manner!


I was attempting to clarify the clustering terminology for the general 
audience out here, not correcting you, I'm sorry I didn't word my prologue 
better. I should have stated something more like "To clarify what Ryan 
said...."    As the initial question was rather vague, I thought some 
further explanations might be appropriate.



I recently did some research into clustering technology for a project at my 
company, and at first found myself confused, coming at it from outside, 
until I got a grasp on the two distinctly different clustering techniques 
(in my case, what we needed was the high availability sort, but my first 
contact with clustering software was the wrong sort entirely, and as I 
didn't understand the distinction I wasted a few weeks investigating the 
high performance stuff instead).



The problem was, in my case at least, the people I first talked with were 
from the scientific market, and they just referred to them as 'clusters' 
without the 'high perf' or 'high availability' qualifier.



BTW, I've done a *bit* more than read clustering whitepapers, I've 
configured and built a couple of prototype clusters, one a small linux 
"Oscar" high perf cluster to see how that stuff all works with MPI and so 
forth, then a Veritas Cluster with a pair of Solaris systems to evaluate how 
the high availability failover stuff could work with respect to our 
application...   This by no means makes me an expert in this, as these were 
simplistic first pass evaluations, not production experience, but it does 
give me some basic insight into how an outsider might misunderstand 
clustering technologies.



Sorry for the misunderstanding.



- -john r pierce



_________________________________________________________________________
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Thu, 15 Jan 2004 19:15:46 +0000
From: "Brian J. Beesley" <[EMAIL PROTECTED]>
Subject: Re: Mersenne: double-check mismatches

On Thursday 15 January 2004 01:00, Max wrote:
> Hello!
>
> Is any statistix on double-check mismatches available?
> How often this happens?

~2% of all runs are bad.
>
> If my result get mismatch with some other's will I get any notice about
> that?

No. But you can check the database - any results in the file "bad" have been 
rejected because of a residual mismatch.

> Can I learn which of my results were confirmed by others?

Yes. Check the "lucas_v" database file.
>
> P.S. Having periodical problems with overheating (coolers become dusty)
> causing ``roundoff'' etc. hardware errors in mprime,

Can you not run a hardware monitor program based on lm_sensors so that an 
alarm sounds at a temperature below that which causes problems? Most P4 
chipsets will also automatically throttle the CPU clock if/when overheating 
occurs, so you will be notified by increasing iteration times rather than 
errors.

> I don't much believe in computational results unless they're confirmed
> by several parties.

This attitude is entirely reasonable for long runs given consumer-grade 
hardware.

> BTW, how error-proof is mprime ?

On its own, not particularly. The computational cost of reasonably robust 
self-checking would be too much to bear. However, given that independent 
double checks are run, the _project system_ is pretty good - matching 
residuals mean that the chance of an error getting into the database as a 
result of a computational error is of the order of 1 in 10^20.

_Detected_ errors - roundoff or otherwise - are not a problem. It's the 
undetected ones which are dangerous.

If you have any ideas about how to improve this, I'm sure that George will 
consider them.

There _are_ significant weaknesses in the project - in particular there is a 
_possibility_ that forged double check results could be submitted - that is 
one reason why I'm trying to triple-check all the exponents where both tests 
were run by the same user. Yes, I'm aware that a determined person with a 
working forging formula could bypass that check, too, but we've got to start 
somewhere.

Regards
Brian Beesley
_________________________________________________________________________
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers
_________________________________________________________________________
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Thu, 15 Jan 2004 14:37:42 -0600
From: "Ryan Malayter" <[EMAIL PROTECTED]>
Subject: RE: Mersenne: p95

[John R Pierce]
> Sorry for the misunderstanding.

My apologies as well... Looking back, it appears I overreacted a bit,
and your message didn't specifically label *me* "confused", but rather
stated "I see a bit of confusion here".

Regards,
        Ryan
_________________________________________________________________________
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Thu, 15 Jan 2004 22:10:00 -0800
From: Max <[EMAIL PROTECTED]>
Subject: Mersenne: Re: double-check mismatches

On Thursday 15 January 2004 12:22, Brian J. Beesley wrote:

 >On Thursday 15 January 2004 01:00, Max wrote:

 >> Is any statistix on double-check mismatches available?
 >> How often this happens?

 >~2% of all runs are bad.

It would be also interesting to learn how often the first run is bad, and how often is 
the second?

It seems to me that first run should be bad more often than the second. Is that true?
My reasoning is that first run is usually done on modern 
(fast/overclocked/unstable/etc) hardware while the second one is done on the old/slow 
but more stable/trusted hardware.

Please correct me if I'm wrong.

Thanks,
Max

_________________________________________________________________________
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Thu, 15 Jan 2004 22:49:24 -0800
From: Kevin Sexton <[EMAIL PROTECTED]>
Subject: Re: Mersenne: Re: double-check mismatches

Max wrote:

> On Thursday 15 January 2004 12:22, Brian J. Beesley wrote:
>
> >On Thursday 15 January 2004 01:00, Max wrote:
>
> >> Is any statistix on double-check mismatches available?
> >> How often this happens?
>
> >~2% of all runs are bad.
>
> It would be also interesting to learn how often the first run is bad, 
> and how often is the second?
>
> It seems to me that first run should be bad more often than the 
> second. Is that true?
> My reasoning is that first run is usually done on modern 
> (fast/overclocked/unstable/etc) hardware while the second one is done 
> on the old/slow but more stable/trusted hardware.
>
> Please correct me if I'm wrong.
>
> Thanks,
> Max
>
A reason for that to be reversed would be random "cosmic ray" errors. A 
faster computer allows less time per exponent for an error to occur. Say 
such an error occurs about once every year per computer (I know this is 
to often, just an example) the slow computer that finishes 1 per year 
would have about 1 error per exponent, one that completes 10 in a year 
would have one error on average every 10 exponents.

> _________________________________________________________________________
> Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
> Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers
>

_________________________________________________________________________
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Fri, 16 Jan 2004 07:54:20 +0000
From: "Brian J. Beesley" <[EMAIL PROTECTED]>
Subject: Re: Mersenne: Re: double-check mismatches

On Friday 16 January 2004 06:10, Max wrote:
>
> It would be also interesting to learn how often the first run is bad, and
> how often is the second?

Yes - I don't think this information is readily available, though sometimes 
you can infer the order of completion from the program version number.

To do the job properly either the "bad" database would need an extra field 
(date of submission) or a complete set of "cleared.txt" files would be 
required - and this would miss any results submitted manually.
>
> It seems to me that first run should be bad more often than the second. Is
> that true? My reasoning is that first run is usually done on modern
> (fast/overclocked/unstable/etc) hardware while the second one is done on
> the old/slow but more stable/trusted hardware.

Interesting theory - but surely the error rate would be expected to be 
proportional to the run length, which would tend to make fast hardware appear 
to be relatively more reliable - conversely smaller / lower power components 
(required to achieve high speed) would be more subject to quantum tunnelling 
errors. For those who think in terms of cosmic rays, this means a less 
energetic particle hit will be enough to flip the state of a bit.

In any case the exponents ~10,000,000 which are being double checked now were 
originally tested on "leading edge" hardware about 4 years ago, when 
overclocking was by no means unknown but was often done without the sort of 
sophisticated cooling which is readily available these days.

Regards
Brian Beesley
_________________________________________________________________________
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Fri, 16 Jan 2004 07:54:20 +0000
From: "Brian J. Beesley" <[EMAIL PROTECTED]>
Subject: Re: Mersenne: Re: double-check mismatches

On Friday 16 January 2004 06:10, Max wrote:
>
> It would be also interesting to learn how often the first run is bad, and
> how often is the second?

Yes - I don't think this information is readily available, though sometimes 
you can infer the order of completion from the program version number.

To do the job properly either the "bad" database would need an extra field 
(date of submission) or a complete set of "cleared.txt" files would be 
required - and this would miss any results submitted manually.
>
> It seems to me that first run should be bad more often than the second. Is
> that true? My reasoning is that first run is usually done on modern
> (fast/overclocked/unstable/etc) hardware while the second one is done on
> the old/slow but more stable/trusted hardware.

Interesting theory - but surely the error rate would be expected to be 
proportional to the run length, which would tend to make fast hardware appear 
to be relatively more reliable - conversely smaller / lower power components 
(required to achieve high speed) would be more subject to quantum tunnelling 
errors. For those who think in terms of cosmic rays, this means a less 
energetic particle hit will be enough to flip the state of a bit.

In any case the exponents ~10,000,000 which are being double checked now were 
originally tested on "leading edge" hardware about 4 years ago, when 
overclocking was by no means unknown but was often done without the sort of 
sophisticated cooling which is readily available these days.

Regards
Brian Beesley
_________________________________________________________________________
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers
_________________________________________________________________________
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Fri, 16 Jan 2004 01:18:15 -0800
From: Kevin Sexton <[EMAIL PROTECTED]>
Subject: Re: Mersenne: Re: double-check mismatches

As far as I know overclocking is not done to the extent that it was in 
the past, when with proper cooling  some processors would operate at a 
signifigant percentage increase in speed. New processors operate in the 
microwave range when you look at their frequencies. You can imagine 
wierd things that may happen in a chip when you go from 3.2 Ghz to 4.2, 
strange things with induction and capacitance of the traces, a signal on 
one trace jumping to another as an RF signal, it just doesn't seem to me 
to be worth the effort to push a chip beyond it's rated speed anymore. 
Also the bottleneck is not so much the processor, but memory, chipset 
and graphics card have a large impact on how fast that game plays. It is 
now more work, for less benefit to overclock.

Brian J. Beesley wrote:

>On Friday 16 January 2004 06:10, Max wrote:
>  
>
>>It would be also interesting to learn how often the first run is bad, and
>>how often is the second?
>>    
>>
>
>Yes - I don't think this information is readily available, though sometimes 
>you can infer the order of completion from the program version number.
>
>To do the job properly either the "bad" database would need an extra field 
>(date of submission) or a complete set of "cleared.txt" files would be 
>required - and this would miss any results submitted manually.
>  
>
>>It seems to me that first run should be bad more often than the second. Is
>>that true? My reasoning is that first run is usually done on modern
>>(fast/overclocked/unstable/etc) hardware while the second one is done on
>>the old/slow but more stable/trusted hardware.
>>    
>>
>
>Interesting theory - but surely the error rate would be expected to be 
>proportional to the run length, which would tend to make fast hardware appear 
>to be relatively more reliable - conversely smaller / lower power components 
>(required to achieve high speed) would be more subject to quantum tunnelling 
>errors. For those who think in terms of cosmic rays, this means a less 
>energetic particle hit will be enough to flip the state of a bit.
>
>In any case the exponents ~10,000,000 which are being double checked now were 
>originally tested on "leading edge" hardware about 4 years ago, when 
>overclocking was by no means unknown but was often done without the sort of 
>sophisticated cooling which is readily available these days.
>
>Regards
>Brian Beesley
>_________________________________________________________________________
>Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
>Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers
>_________________________________________________________________________
>Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
>Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers
>
>  
>

_________________________________________________________________________
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Sat, 17 Jan 2004 02:32:35 +0000
From: Daran <[EMAIL PROTECTED]>
Subject: Re: Mersenne: double-check mismatches

On Thu, Jan 15, 2004 at 07:15:46PM +0000, Brian J. Beesley wrote:

> ...matching 
> residuals mean that the chance of an error getting into the database as a 
> result of a computational error is of the order of 1 in 10^20.

That's per exponent, isn't it?  The chance that one of the roughly quarter
million status-doublechecked exponents being in error is about five orders
of magnitudes higher.

Still acceptible, or at least a minor consern in comparison to the other
security issues.

> Regards
> Brian Beesley

Daran G.
_________________________________________________________________________
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

End of Mersenne Digest V1 #1104
*******************************

Reply via email to