Mersenne Digest      Wednesday, November 10 1999      Volume 01 : Number 658




----------------------------------------------------------------------

Date: Mon, 8 Nov 1999 16:43:24 +0100
From: "Steinar H. Gunderson" <[EMAIL PROTECTED]>
Subject: Mersenne: Re: Version 19 for FreeBSD?

On Mon, Nov 08, 1999 at 01:28:05AM -0500, Bryan Fullerton wrote:
>Well, yes, that's possible - or we can just run the v18 client for FreeBSD.

But v19 is (a bit) faster, and has some extra featurs as well, no?

>Given that there already is a v18 for FreeBSD, I'd assume that there's
>someone who's helping George with that.  Is that a correct assumption?

Probably. But porters don't always have time to keep up with version
upgrades. Use the source, Luke (eesh).

/* Steinar */
- -- 
Homepage: http://members.xoom.com/sneeze/
_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Mon, 8 Nov 1999 18:09:52 EST
From: [EMAIL PROTECTED]
Subject: Mersenne: Re: Mlucas on SPARC

Bill Rea <[EMAIL PROTECTED]> wrote (re. Mlucas and
MacLucasUNIX timings on a 400MHz SPARC E450 with 4MB L2 cache):

>>At 256K FFT I was seeing 0.11 secs/iter with
>>Mlucas against 0.15 secs/iter for MLU, at 512K FFT the figures were
>>0.25 secs/iter and 0.29 secs/iter respectively.

...and later added:

>Observing the running process with top, it looks to me as if
>the 256K  Mlucas fits completely in the L2 cache, whereas MLU
>takes about 12Mb. At 512K FFT around 60% of the running process
>will fit in the L2 cache. It's a cache size issue here.

A good rule of thumb is that Mlucas needs about (FFT length) x 8 bytes
plus another 10% or so of memory for the data, plus some more for the
program itself. Thus a 256K FFT should need a bit more that 2MB for
data, and perhaps a few 100KB more for code, i.e. the whole thing
should fit into a 4MB L2 cache with room to spare. At 512K FFT length,
data+code need probably 5-6MB total.

>On my Ultra-5 with a small 256Kb L2 cache I get 0.58 secs/iter
>for MLU against 0.78 secs/iter for Mlucas at 512K FFT.

These timings suggest that it's more than a cache size issue - after
all, Mlucas has a smaller memory footprint irrespective of the CPU,
and one would expect some benefit from this even in cases where
both codes significantly exceed the L2 cache size. I wonder if
the fact that the code itself (due to all the routines needed for
non-power-of-2 FFT lengths, which MLU doesn't support) might be
causing a slowdown here, by competing for space in the L2 cache
with FFT data? I've little experience with this aspect of performance,
but perhaps conditional compilation, with each binary incorporating
only the routines it needs for that length, could reduce the code
footprint and help performance - would any of the computer science
experts care to comment?

Or perhaps the compiler and OS already do a good job at keeping only
needed program segments in cache, in which case the problem lies yet
elsewhere.

One other possibility is that the SPARC f90 compiler, being *much*
younger than the (apparently very good) C compiler, has better
support for the v8 and v9 instruction sets, i.e. they didn't put
as much work into including optimizations for older CPUs - I don't
know.

We also could use help from any SPARC employees familiar with the
compiler technology to tell us why the f90 -xprefetch option is
so unpredictable - it speeds some runlengths by up to 10%, but more
often causes a 10-20% slowdown (or no change).

Bill also wrote:

>>>Mlucas runs significantly faster if you can compile and run it
>>>on a 64-bit Solaris 7 system.

I replied:

>>This is the first I've heard of this - roughly how much of a speedup
>>do you see?

Bill again:

>I'm a bit red-faced on this one. I just tried it again and it doesn't.
>This is still a mystery to me. It would seems to me that for
>this type of code that having full access to the 64-bit instruction
>set of the UltraSPARC CPUS and running it on a 64-bit operating system
>would give you the best performance. But that doesn't seem to be the
>case. 

Well, I didn't really expect it to make much difference, which is why
I expressed surprise when you said that it did. Was the slowdown you
mentioned for MLU in 64-bit mode also spurious?

Floating-point dominated code like Mlucas shouldn't really benefit
much from the 64-bit mode, except perhaps compared to that generated
by the crappy f90 v1 compiler, which wouldn't do 64-bit load/stores
even if one specified such in the compile flags.

In any event, Mlucas appears to perform very well on the newer SPARC
CPUs and decently well on the old ones (but there one may want to use
MLU at power-of-2 runlengths if it proves faster), so perhaps we
shouldn't get too greedy...I'm joking, of course - speed, give us
more speed!

- -Ernst

_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Mon, 8 Nov 1999 18:09:53 EST
From: [EMAIL PROTECTED]
Subject: Mersenne: Mlucas on multiple CPUs

One reminder about using Mlucas on multiple machines - we don't have
an automated interface yet (and some users appear to actually enjoy
sending in their latest results manually - with new exponents, it's
not like they're having to check their output files daily), but I've
tried to make it reasonably easy to manage assignments. Just paste
2 or 3 exponents (perhaps more, if doing double-checking on a fast
machine) into the worktodo.ini file, then, when the code is down
to only a week or so of work (i.e. there are only 1-2 p's left in
the .ini file), just append more exponents to the end of the file-
you can do this without stopping the program.

For those of you used to the Prime95 worktodo.ini format, remember
that Mlucas needs EXPONENTS ONLY in this file, nothing else. (We'll
only need the full format when we integrate the factoring modules.)

Also, remember to strip off the leading space in the Mxxyyzz Res64:
output line before sending any results to PrimeNet. I'll fix this
problem in the next release, for now you'll need to do it manually.

Happy hunting,
- -Ernst

_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Tue, 09 Nov 1999 04:20:04 +0100
From: Sturle Sunde <[EMAIL PROTECTED]>
Subject: Re: Mersenne: Re: Mlucas on SPARC 

> perhaps conditional compilation, with each binary incorporating
> only the routines it needs for that length, could reduce the code
> footprint and help performance

It can help if it makes it possible for the compiler to eliminate jumps 
and branches that way.

> Or perhaps the compiler and OS already do a good job at keeping only
> needed program segments in cache, in which case the problem lies yet
> elsewhere.

Usualy.

> One other possibility is that the SPARC f90 compiler, being *much*
> younger than the (apparently very good) C compiler, has better
> support for the v8 and v9 instruction sets, i.e. they didn't put
> as much work into including optimizations for older CPUs - I don't
> know.

A compiler is normaly built of several separate parts: 

The frontend:
 o Lexical analyzer, syntax analyzer and semantic analyzer
   - Usualy tightly coupled and allways language specific.  This is 
     the frontend, or the parser.
 o Intermediate code generator
   - Takes a syntax tree from the parser and makes intermediate code 
     on a simple format -- simple to produce and simple to make 
     machine code from.  Some optimiziations are practical to do here 
     (some things are easier to detect from the syntax tree than from 
     the intermediate code) but most of it is left for the next step.
The backend:
 o Code optimizer
   - Optimizes the intermediate code in several passes.
 o Code generator
   - Outputs the actual code for the specific platform.

As you can see it is by design very easy to use the same backend for 
different languages and opposite, so I doubt that Sun is developing 
separate backends for different languages.  They only need to adjust 
the frontend to create working intermediate code from a Fortran 
program, and leave the rest to their already working backend.  To 
support another CPU they only need to change the backend to optimize 
and create code for that CPU.  A lot of the optimizing is also CPU 
independant.

A quick but closer look at how a compiler works and optimizations are 
done is chapter 14 of "Using and Porting GNUS CC" by Richard Stallman. 

> We also could use help from any SPARC employees familiar with the
> compiler technology to tell us why the f90 -xprefetch option is
> so unpredictable - it speeds some runlengths by up to 10%, but more
> often causes a 10-20% slowdown (or no change).

I'm not a SPARC employee, but I have some credits in graduate level 
compiler technology. 8-)

Prediction is often very hard to do in a compiler.  Perfect register 
allocation is practicaly impossible.  You can try and in most cases 
get a very good result, but not perfect.  If you try too hard and guess 
wrong a few places, you may as well end up doing the oposite of 
optimization.  

For the compiler it might look like you need a chunk from the memory in 
only a short time, and it is "smart" and prefetches everyting well ahead.  
It might even allocate registers for the data, while the CPU gets time to 
predict the next branch.  Very smart when it works.  In reality, however, 
oposite to what the compiler predicted, you might loop around 1000 more 
times before you take the other branch.  Because the compiler was wrong, 
you have lost some registers and some data you needed were lost from the 
cache and filled with data you don't need until you've looped 1000 times 
more.  

The only secure way to get this right is to use good profiling feedback 
which can record information of how often each variable and constant is 
needed, find patterns in how the program accesses what memory, where 
branches tend to go, etc, and then use this information to optimize 
register allocations, prefetching, etc in a second pass trough the 
compiler.

To people specialy interested in performance on Sun, I can recommend 
chapter 2 of the whitepaper "Delivering Performance on Sun: Optimizing 
Applications for Solaris", which is availiable here: 
  PostScript: http://www.sun.com/workshop/whitepapers/sol-app-perf.ps
  Acrobat:    http://www.sun.com/workshop/whitepapers/sol-app-perf.pdf

You shold also install the latest patches from Sun.  There are alot for 
f90: <URL: http://access1.sun.com/workshop/current-patches.html>


- -- 
Sturle                           URL: http://www.stud.ifi.uio.no/~sturles/
~~~~~~                               This will end your Windows session.
                                      Do you want to play another game?

_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Tue, 9 Nov 1999 00:08:30 -0500 (EST)
From: jim barchuk <[EMAIL PROTECTED]>
Subject: Mersenne: 19.0.2 and ERROR 2250: Server unavailable

Hello All!

Yes, I see the comments about this in the archives, but there's no fix
suggested.

Been running mprime like the Energiser Bunny 16.3, 17.1, 18, and now
19.0.2. Using RedHat 4.2 with lots of upgrades, kernel 2.0.33.

I haven't been reading this list lately. I did receive email on Oct 10
that V19 was avail, and it is 'faster,' so natually I picked it up. It
dropped in and appeared to work fine. BTW I use the static version because
dynamic programs rarely like RH 4.2.

I look at mprime.log very infrequently. I do look at top usually a few
times a day, and mprime is always at the top of the list. Today I glance
and it's nowhere to be seen. ps -a says it's running, but using
essentially no resources.

So I look at the log and find 287 entries of ERROR 2250: Server
unavailable dating back to the -day- I started V19 running. It's been
- -runnning- since then, but -no- obvious notice of this problem.

Yes, I tried the dynamic version, it fails with error in loading shared
libraries : undefined symbol: __libc_start_main.

Any clues?

BTW I -strongly- suggest that when something like this happens that
another 'whoopsie, potential serious error' message be broadcast to all
participants, not just mention of it in the list. As it stands my best (in
fact only AFAIK) option is to fall back to V18 and lose a month's worth of
CPU time.

Thanks much. Have a :) day!

jb

- -- 
jim barchuk
[EMAIL PROTECTED]

_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Tue, 9 Nov 1999 14:42:52 +0000
From: "Steinar H. Gunderson" <[EMAIL PROTECTED]>
Subject: Re: Mersenne: 19.0.2 and ERROR 2250: Server unavailable

On Tue, Nov 09, 1999 at 12:08:30AM -0500, jim barchuk wrote:
>I haven't been reading this list lately. I did receive email on Oct 10
>that V19 was avail, and it is 'faster,' so natually I picked it up. It
>dropped in and appeared to work fine. BTW I use the static version because
>dynamic programs rarely like RH 4.2.

The problem seems to be that mprime is compiled to glibc 2.1, while RH 4.2
uses... libc5? glibc 2.0? glibc 2.1 (even in static version) needs some
special files to operate. This has been discussed on the list before, and
perhaps it's time to fix it :-)

I could perhaps download and compile glibc 2.1, and compile it with static NSS.
Then I could compile mprime (perhaps I'll need George's security.c and
security.h, but that's another matter entirely -- we'll take that when it
comes) with a version that would work for non-2.1 users. Any protests?

>Any clues?

For now, you could revert to v18, use v18 only to report your results, report
them via the manual forms, or just wait for us to do something about it :-)

/* Steinar */
_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Tue, 9 Nov 1999 13:05:43 -0500
From: "Geoffrey Faivre-Malloy" <[EMAIL PROTECTED]>
Subject: Mersenne: Testers needed

After much anticipation and delay (well, maybe not anticipated because I
didn't tell many people about it), Prime95Setup is available.  If you'd like
to take a few minutes and help test it and make comments/suggestions, I'd
appreciate it.  You can download it (for now) at:

http://www.mindspring.com/~gjf/Prime95Setup.EXE

It's about 650kb in size.  Yes, I know it's larger but I hope this will help
us to get more coverage as it will make it easier to install.

Standard disclaimer applies - ya know the one that says I'm not responsible
if it formats your hard drive, sends all your private data to Microsoft,
etc. :)  I've tested it out and I don't think it's going to do that but ya
never know :)

Anyway, let me know if you find any bugs.  Flames should be sent to
/dev/null

G-Man

_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Tue, 9 Nov 1999 18:43:14 -0000
From: "Brian J. Beesley" <[EMAIL PROTECTED]>
Subject: Re: Mersenne: 19.0.2 and ERROR 2250: Server unavailable

On 9 Nov 99, at 14:42, Steinar H. Gunderson wrote:

> The problem seems to be that mprime is compiled to glibc 2.1, while RH 4.2
> uses... libc5? glibc 2.0? glibc 2.1 (even in static version) needs some
> special files to operate. This has been discussed on the list before, and
> perhaps it's time to fix it :-)

I tend to agree... Actually someone (George?) put up mprime & sprime 
v19.1 over the weekend, which (according to the whatsnew file) has a 
workaround coded in - though I'm unable to check whether it works, 
since I've not had problems with v19.0 on any of my systems.

> I could perhaps download and compile glibc 2.1, and compile it with static NSS.
> Then I could compile mprime (perhaps I'll need George's security.c and
> security.h, but that's another matter entirely -- we'll take that when it
> comes) with a version that would work for non-2.1 users. Any protests?

Would it be reasonable to suggest instead that someone build sprime 
v19.x on a system running RH 4.x, or something else old enough that 
the dynamic library problem hadn't been invented? Or does that 
replace one problem with another, viz. a binary that won't run on 
current releases?

I don't think there is any other significant issue ... ???
> 
> For now, you could revert to v18, use v18 only to report your results, report
> them via the manual forms, or just wait for us to do something about it :-)

Reverting to v18 - you'll lose any work in progress & have to restart 
it from scratch 8-(
Using v18 to report results - there would seem to be scope for 
confusion here - messy! 8-(
Reporting via manual forms - sounds OK for a few systems that are 
relatively easily "got at", but unmanageable for large numbers of 
systems - especially if they don't have a web browser available. Also 
will cause loss of PrimeNet credit 8-(
Just wait - I think we could & should get something done about this 
PDQ.

One possible option would be to archive the v19 save file(s), stop 
v19, remove the current assignment from worktodo.ini (by deleting the 
top line in the file) and restart using v18 for the time being. When 
a fixed v19 becomes available, replace the save file(s) & the 
worktodo.ini entry for the assignment. This will preserve work in 
progress, though the credit will obviously be deferred.

Also the option of upgrading to a new version of linux should be 
considered. The cost is not great (get the CheapBytes CD for less 
than $5), though there may be an issue with disk space on some 
systems - linux bloats too, though by no means as badly as windoze! 
The point here is that there are a number of security problems 
(especially in sendmail and the ftp daemon) which need attention; 
upgrading to a recent release is a reasonably convenient way of 
fixing these. Though note that RH 6.0 _still_ needs wu-ftpd upgraded; 
everything prior to v2.6.0 is vulnerable, and hackers are actively 
scanning IP address space looking for systems running wu-ftpd v2.5.x.

Regards
Brian Beesley
_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Tue, 9 Nov 1999 18:43:14 -0000
From: "Brian J. Beesley" <[EMAIL PROTECTED]>
Subject: Re: Mersenne: Re: Mlucas on SPARC 

On 9 Nov 99, at 4:20, Sturle Sunde wrote:

> > perhaps conditional compilation, with each binary incorporating
> > only the routines it needs for that length, could reduce the code
> > footprint and help performance
> 
> It can help if it makes it possible for the compiler to eliminate jumps 
> and branches that way.

Or optimizes jumps & branches. There's a significant performance 
penalty associated with jumping to an instruction which is badly 
aligned with the way in which blocks of code are prefetched for 
decoding.

If unexecuted code is confined to prefetch blocks (and, better still, 
virtual pages) which don't contain code which is actually executed, 
the associated performance penalty is very small to non existent.
> 
> > Or perhaps the compiler and OS already do a good job at keeping only
> > needed program segments in cache, in which case the problem lies yet
> > elsewhere.
> 
> Usualy.

This actually has a lot to do with the hardware. It's not the size of 
the cache so much as its organization. If the L1 data cache uses e.g. 
32 byte lines, like the Pentium, and is only 4-way associative, you 
have direct access to _at most_ only 128 bytes of data - however big 
the cache itself is. Worse, if you access data in such a way that 
there are multiples of 128 bytes between accessed elements, you will 
in effect be using only one cache line, and you will get at least 
four times as many cache misses as you'd expect! The L2 cache helps 
you out of this particular hole, but there's still a performance 
penalty. This is an example of the sort of situation that compiler 
writers find quite hard to deal with.

Also there's the point that we're dealing with virtual machines. The 
best example I can think of to illustrate this is the old VAX problem 
of initializing a square array:

                DIMENSION A(1000,1000)
                DO 10 J=1,1000
                DO 10 I=1,1000
10              A(I,J)=0.0

runs very many times faster (well over 100 times!) if you simply swap 
round the DO statements (assuming that you have a maximum working set 
size of  512 pages = 256KB, which was typical on early VAX 
installations).

The reason is that one way writes the elements in virtual memory 
order, with just one demand-zero page fault on first access and just 
one write dirty page fault on completion; running by "columns" 
instead of "rows" generates 127 extra re-read and write dirty page 
faults for each of the 7813 pages referenced. And, with a small 
working set, many of the page faults would result in actual I/O to 
the swap/page file as well as the overhead of actually resolving 
virtual addresses to physical.

If you're writing in C, you'd probably use pointers instead of array 
subscripts to do this particular job; but the required order of the 
loops for efficient operation would be _reversed_ compared with 
Fortran.

You really do need to use profiling tools to evaluate code; you also 
need detailed knowledge of the architecture you're working with in 
order to get things really well optimized.

Compiler design & implemetation can and does make a significant 
difference, but we're a long way from being able to tell a compiler, 
e.g., "Make me a program to LL test Mersenne numbers; I care about 
execution speed at the expense of everything else".

> The only secure way to get this right is to use good profiling feedback 
> which can record information of how often each variable and constant is 
> needed, find patterns in how the program accesses what memory, where 
> branches tend to go, etc, and then use this information to optimize 
> register allocations, prefetching, etc in a second pass trough the 
> compiler.

Actually you need to repeat this step until every change you try 
makes things worse rather than better.


Regards
Brian Beesley
_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Wed, 10 Nov 1999 08:28:31 +1300 (NZDT)
From: Bill Rea <[EMAIL PROTECTED]>
Subject: Mersenne: Re: Mlucas on SPARC

> From [EMAIL PROTECTED] Tue Nov  9 12:10:09 1999
> 
> >On my Ultra-5 with a small 256Kb L2 cache I get 0.58 secs/iter
> >for MLU against 0.78 secs/iter for Mlucas at 512K FFT.
> 
> These timings suggest that it's more than a cache size issue - after
> all, Mlucas has a smaller memory footprint irrespective of the CPU,
> and one would expect some benefit from this even in cases where
> both codes significantly exceed the L2 cache size. I wonder if
> the fact that the code itself (due to all the routines needed for
> non-power-of-2 FFT lengths, which MLU doesn't support) might be
> causing a slowdown here, by competing for space in the L2 cache
> with FFT data? I've little experience with this aspect of performance,
> but perhaps conditional compilation, with each binary incorporating
> only the routines it needs for that length, could reduce the code
> footprint and help performance - would any of the computer science
> experts care to comment?

These low end machines do seem to have a bottleneck feeding the
processor from memory. An Ultra-5 is not an old machine. When I first
started trying different compiler options I found the timed tests
supplied with MLU would give very significant speed differences, but these
almost never translated into a speed difference on a real exponent. I
had a chance to have a very brief talk with a Sun Engineer about this and
he said these low end machines are built to be cost-competitive with
PCs and the CPU probably spends a lot of its time "spinning its wheels"
while waiting for data to be fed to it. Once I got the iteration time
down to 0.6 secs/iter, which was achieved with minimal compiler options,
I didn't get any speed improvement at the 512K FFT level until I used 
the profiling. With that I got it down to the present 0.58 secs/iter.


> Or perhaps the compiler and OS already do a good job at keeping only
> needed program segments in cache, in which case the problem lies yet
> elsewhere.

The executable sizes are quite different:-

113216 bytes for MLU
340048 bytes for Mlucas

I don't know enough about how the operating system handles code
in the cache to know whether this is significant. I would guess
on a little 256Kb cache it could make a difference to the speed
of execution. For a large 4Mb cache this probably isn't much to 
worry about. 


[snip] 
> We also could use help from any SPARC employees familiar with the
> compiler technology to tell us why the f90 -xprefetch option is
> so unpredictable - it speeds some runlengths by up to 10%, but more
> often causes a 10-20% slowdown (or no change).

This would be very helpful. I'm sure Ernst would be happy to let
the compiler writers use his code to improve the compiler. (Any
Sun engineers on this list?) 

> Well, I didn't really expect it to make much difference, which is why
> I expressed surprise when you said that it did. Was the slowdown you
> mentioned for MLU in 64-bit mode also spurious?

It wasn't when I reported it, but it is now spurious. With systems
with small caches I now think that the compiler option -xspace is
very important. This tells the compiler to do no optimizations which
would increase code size. Compiling MLU as 64-bit initially
resulted in a much bigger executable. With the restart capability
it's fairly easy to compile a new binary and within a couple of
hours know whether you've improved the speed, but the save files
of 32 and 64-bit MLU are not compatible so you have to opt for
one or the other on a particular exponent and stay with it to
the end. After several recompiles and restarts there is now no
noticeable difference in speed between 32 and 64-bit MLU. But I've
also mananged to reduce the executable size of the 64-bit version
to where it's virtually the same as the 32-bit.

Bill Rea, Information Technology Services, University of Canterbury  \_ 
E-Mail b dot rea at its dot canterbury dot ac dot nz                 </   New 
Phone   64-3-364-2331, Fax     64-3-364-2332                        /)  Zealand 
Unix Systems Administrator                                         (/' 


 
 
_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Tue, 9 Nov 1999 18:41:16 EST
From: [EMAIL PROTECTED]
Subject: Mersenne: executable size vs. L2 cache (Was: Mlucas on SPARC)

Bill Rea wrote:

> >On my Ultra-5 with a small 256Kb L2 cache I get 0.58 secs/iter
> >for MLU against 0.78 secs/iter for Mlucas at 512K FFT.

I wrote:

> These timings suggest that it's more than a cache size issue - after
> all, Mlucas has a smaller memory footprint irrespective of the CPU,
> and one would expect some benefit from this even in cases where
> both codes significantly exceed the L2 cache size. I wonder if
> the fact that the code itself (due to all the routines needed for
> non-power-of-2 FFT lengths, which MLU doesn't support) might be
> causing a slowdown here, by competing for space in the L2 cache
> with FFT data?

Bill replied:

>The executable sizes are quite different:-
>
>113216 bytes for MLU
>340048 bytes for Mlucas

That is *exactly* the kind of size difference one might expect to cause
an appreciable performance difference on a 256KB cache machine. I'll
bet the runtime profiling is designed to strip out unused code sections
from the executable image, and thus reduce its size. If you still can't
get -xprofile to compile decently fast on Mlucas (a problem you noted
earlier),there's a manual way to test this hypothesis, namely:

1) Pick an FFT length for testing. Look at the combination of FFT
radices used for that N by Mlucas in mers_mod_square. (Example: for
N = 224K = 224*1024 = 229376, mers_mod_square lists a set of complex
radices (7,8,8,16,16), whose product is 229376/2.

2) Comment out all subroutine calls in mers_mod_square which are not
to the routines for the radices in (1). E.g. for the 224K example,
in the select case(radix(1)) blocks, comment out all calls except
the ones to radix7_dif_pass1, radix7_ditN_cy_dif1 and radix7_dit_pass1.

3) Recompile and compare the the size of the executable to that of the
full executable. If it's not substantially smaller, you may have to
physically remove the commented-out subroutines from the program file,
then recompile.

4) Once your .exe is reasonably small, run some timings. Note that the
code compiled for 224K above would also work for any N which uses a
combination of radices of the form (7,{any combination of 4,8 or 16},16)
(the final radix must always be a 16), i.e. also for 112K and 448K
(assuming you're using radices (7,8,16,16,16), not (14,4,16,16,16)
for the latter.)

If this kind of thing does prove helpful (especially on small-cache
and/or bandwidth) systems, once we have a Unix PrimeNet interface
to automate execution, it will be relatively easy to replace the
current single Mlucas executable with a set of smaller ones,
each handling a set of FFT lengths that share the same initial
radix, e.g. radix(1) = 3,5,6,7,8,10,14,16, the ones currently
supported by Mlucas.

The other thing these considerations imply is that compiling in
64-bit mode (due to the large .exe that results) may actually be
counterproductive in many instances, unless the code makes heavy
use of some 64-bit opcodes which are not supported in 32-bit mode.

Let me know what you find,
- -Ernst

_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Tue, 9 Nov 1999 18:41:14 EST
From: [EMAIL PROTECTED]
Subject: Mersenne: New Mlucas binary for Alpha/Linux

Thanks to Paul Novarese, I have a new Mlucas binary for Alpha/Linux:

ftp://209.133.33.182/pub/mayer/bin/ALPHA_LINUX/Mlucas.tgz

Paul pointed out to me that the old binary was not statically compiled,
i.e. needed separate RTL files, in disagreement with the documentation
in my README file. The code is the same as before, just compiled
using the -non_shared flag.

If this flag works similarly on other compilers, future releases for Alpha 
Unix
and SPARC Solaris will also no longer need separate RTL files (although if
you downloaded and installed such already, these would not need to be
updated each time the code changes anyway.)

Cheers,
- -Ernst

_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Tue, 09 Nov 1999 22:02:51 -0500
From: George Woltman <[EMAIL PROTECTED]>
Subject: Mersenne: Linux error 2250 solved

Hi all,

Mprime version 19.1 is now available.  The only new feature of any consequence
is a solution to the error 2250 some users of the staticly linked mprime
suffered.  The whatsnew.txt file describes the line you need to put in
primenet.ini.  You must create the primenet.ini file.

Many thanks to all that helped narrow the problem down to a difference
in the way gethostbyname call worked in different versions of the OS.

Regards,
George

P.S.  P-1 factorers - do not use the Stage1GCD undocumented feature.  It may
have a bug.


_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Tue, 9 Nov 1999 23:04:45 -0500 (EST)
From: "Vincent J. Mooney Jr." <[EMAIL PROTECTED]>
Subject: Mersenne: Results file

[Sun Oct 10 16:37:44 1999]
Self-test 448 passed!
[Mon Oct 18 07:56:26 1999]
Iteration: 2390772/9008231, ERROR: SUM(INPUTS) != SUM(OUTPUTS),
6.646378947096065e+016 != 6.638737981499106e+016
Possible hardware failure, consult the readme file.
Continuing from last save file.
[Mon Nov 08 17:42:12 1999]
M9008231 is not prime. Res64: 9304F573B0401CDE. WU1: 0F8DEADF,1083659,00000000

_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Wed, 10 Nov 1999 12:57:18 -0500
From: George Woltman <[EMAIL PROTECTED]>
Subject: Mersenne: Re: Setup Testers needed

Hi all,

        Geoffrey's email below needs some clarification.  First some
history.  We know that most programs you download contain a nice setup
program rather a plain old zip file.  Geoffrey contacted Wise Solutions
(http://www.wisesolutions.com) and begged on GIMPS' behalf for a free
copy of their Wise for Windows Installer program.  They graciously gave
us a copy.
        Geoffrey was able to quickly us a fancy setup program.  He reports
that Wise's program is excellent.  

        Since the next release of prime95 will use Geoffrey's work it would
be nice if a handful of users could give us both some feedback.  Be sure
to backup your directory before trying it out.  Questions include:
1)  Does it find your current prime95 folder?  2)  Does it install the
new version successfully?  3)  Can you uninstall?  4)  Are their features
missing?  5)  Could some features be better implemented or worded better?

Thanks,
George

At 01:05 PM 11/9/99 -0500, Geoffrey Faivre-Malloy wrote:
>Prime95Setup is available.  If you'd like
>to take a few minutes and help test it and make comments/suggestions, I'd
>appreciate it.  You can download it (for now) at:
>
>http://www.mindspring.com/~gjf/Prime95Setup.EXE
>
>It's about 650kb in size. 

_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

Date: Thu, 11 Nov 1999 14:56:37 +1300 (NZDT)
From: Bill Rea <[EMAIL PROTECTED]>
Subject: Mersenne: Re: executable size vs. L2 cache (Was: Mlucas on SPARC)

> From [EMAIL PROTECTED] Wed Nov 10 12:42:00 1999
> 
> Bill replied:
> 
> >The executable sizes are quite different:-
> >
> >113216 bytes for MLU
> >340048 bytes for Mlucas
> 
> That is *exactly* the kind of size difference one might expect to cause
> an appreciable performance difference on a 256KB cache machine. I'll
> bet the runtime profiling is designed to strip out unused code sections
> from the executable image, and thus reduce its size. If you still can't
> get -xprofile to compile decently fast on Mlucas (a problem you noted
> earlier),there's a manual way to test this hypothesis, namely:
> 
> 1) Pick an FFT length for testing. Look at the combination of FFT
> radices used for that N by Mlucas in mers_mod_square. (Example: for
> N = 224K = 224*1024 = 229376, mers_mod_square lists a set of complex
> radices (7,8,8,16,16), whose product is 229376/2.
> 
> 2) Comment out all subroutine calls in mers_mod_square which are not
> to the routines for the radices in (1). E.g. for the 224K example,
> in the select case(radix(1)) blocks, comment out all calls except
> the ones to radix7_dif_pass1, radix7_ditN_cy_dif1 and radix7_dit_pass1.
> 
> 3) Recompile and compare the the size of the executable to that of the
> full executable. If it's not substantially smaller, you may have to
> physically remove the commented-out subroutines from the program file,
> then recompile.
> 
> 4) Once your .exe is reasonably small, run some timings. Note that the
> code compiled for 224K above would also work for any N which uses a
> combination of radices of the form (7,{any combination of 4,8 or 16},16)
> (the final radix must always be a 16), i.e. also for 112K and 448K
> (assuming you're using radices (7,8,16,16,16), not (14,4,16,16,16)
> for the latter.)
> 
> If this kind of thing does prove helpful (especially on small-cache
> and/or bandwidth) systems, once we have a Unix PrimeNet interface
> to automate execution, it will be relatively easy to replace the
> current single Mlucas executable with a set of smaller ones,
> each handling a set of FFT lengths that share the same initial
> radix, e.g. radix(1) = 3,5,6,7,8,10,14,16, the ones currently
> supported by Mlucas.
> 
> The other thing these considerations imply is that compiling in
> 64-bit mode (due to the large .exe that results) may actually be
> counterproductive in many instances, unless the code makes heavy
> use of some 64-bit opcodes which are not supported in 32-bit mode.

Ernst,

The picture gets muddier. I built 6 different executables and tried
them for speed on my own Ultra-5; 270Mhz CPU, 256Kb L2 cache, and 
128Mb RAM. On five I used profiling, I ran the first one with
512K FFT and 640K FFT before recompiling, all the others were run
with the 224K FFT and recompiled. The compiler options used were:-

options1 = -fast -xO5 -xsafe=mem -xprefetch -xtarget=native 
- -xarch=v8plusa -xchip=ultra2i -xprofile=collect:Mlucas

options2 = -fast -xtarget=native -xarch=v8plusa -xprofile=collect:Mlucas

On recompiling the collect is changed to use, the feedback files
are removed between compiles with different options.

options3 = -fast -xtarget=native -xarch=v8plusa

1) General purpose Mlucas compiled in 64-bit mode with options1, v9a used in
   place of v8plusa.
2) Code commented out as per above instructions, 32-bit mode, with options1.
3) General purpose Mlucas, 32-bit mode, with options1 plus -xspace.
4) Unused code deleted from source file, 32-bit mode, with options1.
5) General purpose Mlucas, 32-bit mode, with options2.
6) General purpose Mlucas, options3.

Executable no.    Size (bytes)   Secs/iter on 224K FFT
1                 416792         0.291
2                 349704         0.265
3                 314416         0.304
4                 215480         0.281
5                 348808         0.276
6                 340064         0.327

The compiler manual says about the -fast option "This option provides
close to the maximum performance for many realistic applications".
It looks like they're right about that, provided you use profiling
and recompile, with only (2) running faster.

With -fast the -xtarget and -xchip options are set, they're redundant.

For comparision MLU does 0.276 secs/iter on a 256K FFT.

I also ran my executable (1) on a 512K FFT and it did 0.70 sec/iter.
That's a big gain on the generic executable's 0.78 secs/iter I reported
earlier, but still a long way from MLU's 0.58 secs/iter.

To get the profiling to work I added a lot more swap space. As you may
recall the compiler was running out of address space and dieing.
It takes close to 2 hours to build an executable with profiling,
run it, then recompile using the results of the profiling. The 
compiler grows to take over 240Mb of address space while optimizing
and the disk rattles continuously as the system pages.

Bill Rea, Information Technology Services, University of Canterbury  \_ 
E-Mail b dot rea at its dot canterbury dot ac dot nz                 </   New 
Phone   64-3-364-2331, Fax     64-3-364-2332                        /)  Zealand 
Unix Systems Administrator                                         (/' 
 



_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

------------------------------

End of Mersenne Digest V1 #658
******************************

Reply via email to