It seems like there may be three (or more) kinds of problem:

- round-off error gets too large and creates an error (and other unknown
software issues)
- specific hardware failures producing errors
- random occurrences producing errors

mprime95 and the numerous ports and alternatives in GIMPS (I am not trying
to be callous towards Mac or Unix/Linux users -- I am too ignorant to be
callous) seem well able to catch several intermittant bugs in software.  I
know that I am frequently (well, several times a month on one machine or
another) hit with SUMOUT errors.  mprime95 just picks up at an earlier point
and restarts.  Occasionally the sumout error repeats but it is rarely at the
same iteration.  Something goes wrong there, but I cannot tell precisely
what.  Often the problem seems to crop up when some other piece of software
is misbehaving -- perhaps there is some memory violation that Windows does
not trap (ah, gee, ya think?).

It seems likely that each of these events are pretty much independent.  Both
expected specific hardware failures and random occurrences seem proportional
to running time.  Each machine and environment would have its own
probabilities (hard to know in advance -- probably hard to know at all).
The possible software glitches seem to be proportional to iterations and as
several comments observe, the machine (or processor) specific failure rates
may increase with CPU speed.  Still, I thought I saw that mprime95
algorithms for Mp run in something on the order of  p^2 ( log p) steps.
When p doubles, the LL runtime should go up by a factor slightly higher than
4 (asymptotically 4, and less than 4.2 at p>8 million).

Stirring this melange together, if we quadruple p (to get from a lot of
current testing up to 33 million) and double processor/RAM speed (well, my
fastest machine is a P-II 266Mhz with a 66Mhz bus -- moving to 550/100 might
double the throughput with available parts), the runtime would increase by
about a factor of 8.  I would naively expect the failure rate to increase by
a factor between 8 and 16, although it could be higher or lower because
those pesky probabilities all change.
Also, if the probability of a failure in any step is r and there are n
steps, then the probability of a clean run is really 1-(1-r)^n which has
lead term nr (and the probability is less than nr, of course).  If we expect
a 1% failure rate at 8M and r stays fixed, then we expect about a 4% failure
at around 33M. (.99^4 is about .96 = 1 - .04).

I like the thread of saving multiple residues at various checkpoints along
the way.  George suggested a % completion series.  I might suggest a
specific series of points -- like every L(1000k).  This might be simpler to
track in a database although the number of entries grows linearly with p so
the data storage might grow with p^2, depending.  Another series like
k*floor(p/s) would work just as well and keep the data needs smaller as it
would have just s+1 checkpoints (s can be fixed for all p).  All of the
steps saved should be saved for both 1st and 2nd run, as George suggested.
There is no point to stopping a 2nd run at the first difference although
there may be great value in starting a 3rd run as soon as possible after the
2nd fails to match the first.  If the third run pops up different from both
the 1st and 2nd run, primenet should send someone a cry for help:  too many
mismatches suggest something strangely wrong.

Might the v.17 problem have been trapped with something like this?  I do not
recall enough of the discussion to know and the ensuing belly-aching
overshadowed the real content of finding/fixing/reworking.  (I know I am
never going to rise high on the list, so I do not worry a whole lot about
how much my report shows.)

One way of testing a new version would be by double checking current and
prior version data.  In fact, I would expect that the quality assurance
group plans to use double-checking as a post-beta test stage.  The data base
saves could let a lot of us help out on that last stage before a full
release.  I know I would be happy to let my double-checking machines do new
version testing.

Joth



----- Original Message -----
From: Aaron Blosser <[EMAIL PROTECTED]>
To: Mersenne@Base. Com <[EMAIL PROTECTED]>
Sent: Wednesday, August 04, 1999 6:18 AM
Subject: RE: Mersenne: Multiple residues - enhancing double-checking


> > This had been discussed earlier.  Brian and I talked about it for a
little
> > while, he came up with the original idea.
>
> Doh!  Curse my memory! :-)
>
> > > I think the idea has definite merit.  If an error does occur,
> > it's equally
> > > likely to happen at any step along the way, statistically.
> > Errors are every
> > > bit as likely to happen on the very first iteration as they are
> > during the
> > > 50% mark, or the 32.6% mark, or on the very last iteration.
> >
> > True, but if the system is malfunctioning then the errors should start
> > early.
>
> Even more reason why it makes sense.
>
> > > Just for example, every 10% along the way, it'll send it's
> > current residue
> > > to the Primenet server.
> >
> > I'm guessing that you mean a certain amount of the residue.  Sending in
> > 10 2meg files for *each* exponent in the 20,000,000 range would get very
> > unwieldy, and inconvenient for people and primenet.
>
> Just a partial residue, like the one sent at the end of the test.  Even
> smaller ones, like a 32 bit instead of 64 bit residue seems like it would
do
> the job splendidly.
>
> > > I forget the numbers being tossed around,
> > > but you'd only save 50% of (the error rate) of the
> > > checking time.
> >
> > As I pointed out above, the error rate should increase with the
> > square of the
> > exponent (plus change).  This means that if 1% have errors at
> > 7mil, 22% will
> > have errors at 30mil.
>
> Frightening to think so.  Are you sure the error rate increases?  Errors
> seem like they'd show up more as a result of faulty hardware, to my
> thinking.  I'd imagine that if a certain machine ran through about 10 10M
> exponent error free, it has a very high likelihood of running a single 20M
> exponent error free.
>
> _________________________________________________________________
> Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
> Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers
>

_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

Reply via email to