Mersenne Digest Wednesday, November 10 1999 Volume 01 : Number 658 ---------------------------------------------------------------------- Date: Mon, 8 Nov 1999 16:43:24 +0100 From: "Steinar H. Gunderson" <[EMAIL PROTECTED]> Subject: Mersenne: Re: Version 19 for FreeBSD? On Mon, Nov 08, 1999 at 01:28:05AM -0500, Bryan Fullerton wrote: >Well, yes, that's possible - or we can just run the v18 client for FreeBSD. But v19 is (a bit) faster, and has some extra featurs as well, no? >Given that there already is a v18 for FreeBSD, I'd assume that there's >someone who's helping George with that. Is that a correct assumption? Probably. But porters don't always have time to keep up with version upgrades. Use the source, Luke (eesh). /* Steinar */ - -- Homepage: http://members.xoom.com/sneeze/ _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Mon, 8 Nov 1999 18:09:52 EST From: [EMAIL PROTECTED] Subject: Mersenne: Re: Mlucas on SPARC Bill Rea <[EMAIL PROTECTED]> wrote (re. Mlucas and MacLucasUNIX timings on a 400MHz SPARC E450 with 4MB L2 cache): >>At 256K FFT I was seeing 0.11 secs/iter with >>Mlucas against 0.15 secs/iter for MLU, at 512K FFT the figures were >>0.25 secs/iter and 0.29 secs/iter respectively. ...and later added: >Observing the running process with top, it looks to me as if >the 256K Mlucas fits completely in the L2 cache, whereas MLU >takes about 12Mb. At 512K FFT around 60% of the running process >will fit in the L2 cache. It's a cache size issue here. A good rule of thumb is that Mlucas needs about (FFT length) x 8 bytes plus another 10% or so of memory for the data, plus some more for the program itself. Thus a 256K FFT should need a bit more that 2MB for data, and perhaps a few 100KB more for code, i.e. the whole thing should fit into a 4MB L2 cache with room to spare. At 512K FFT length, data+code need probably 5-6MB total. >On my Ultra-5 with a small 256Kb L2 cache I get 0.58 secs/iter >for MLU against 0.78 secs/iter for Mlucas at 512K FFT. These timings suggest that it's more than a cache size issue - after all, Mlucas has a smaller memory footprint irrespective of the CPU, and one would expect some benefit from this even in cases where both codes significantly exceed the L2 cache size. I wonder if the fact that the code itself (due to all the routines needed for non-power-of-2 FFT lengths, which MLU doesn't support) might be causing a slowdown here, by competing for space in the L2 cache with FFT data? I've little experience with this aspect of performance, but perhaps conditional compilation, with each binary incorporating only the routines it needs for that length, could reduce the code footprint and help performance - would any of the computer science experts care to comment? Or perhaps the compiler and OS already do a good job at keeping only needed program segments in cache, in which case the problem lies yet elsewhere. One other possibility is that the SPARC f90 compiler, being *much* younger than the (apparently very good) C compiler, has better support for the v8 and v9 instruction sets, i.e. they didn't put as much work into including optimizations for older CPUs - I don't know. We also could use help from any SPARC employees familiar with the compiler technology to tell us why the f90 -xprefetch option is so unpredictable - it speeds some runlengths by up to 10%, but more often causes a 10-20% slowdown (or no change). Bill also wrote: >>>Mlucas runs significantly faster if you can compile and run it >>>on a 64-bit Solaris 7 system. I replied: >>This is the first I've heard of this - roughly how much of a speedup >>do you see? Bill again: >I'm a bit red-faced on this one. I just tried it again and it doesn't. >This is still a mystery to me. It would seems to me that for >this type of code that having full access to the 64-bit instruction >set of the UltraSPARC CPUS and running it on a 64-bit operating system >would give you the best performance. But that doesn't seem to be the >case. Well, I didn't really expect it to make much difference, which is why I expressed surprise when you said that it did. Was the slowdown you mentioned for MLU in 64-bit mode also spurious? Floating-point dominated code like Mlucas shouldn't really benefit much from the 64-bit mode, except perhaps compared to that generated by the crappy f90 v1 compiler, which wouldn't do 64-bit load/stores even if one specified such in the compile flags. In any event, Mlucas appears to perform very well on the newer SPARC CPUs and decently well on the old ones (but there one may want to use MLU at power-of-2 runlengths if it proves faster), so perhaps we shouldn't get too greedy...I'm joking, of course - speed, give us more speed! - -Ernst _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Mon, 8 Nov 1999 18:09:53 EST From: [EMAIL PROTECTED] Subject: Mersenne: Mlucas on multiple CPUs One reminder about using Mlucas on multiple machines - we don't have an automated interface yet (and some users appear to actually enjoy sending in their latest results manually - with new exponents, it's not like they're having to check their output files daily), but I've tried to make it reasonably easy to manage assignments. Just paste 2 or 3 exponents (perhaps more, if doing double-checking on a fast machine) into the worktodo.ini file, then, when the code is down to only a week or so of work (i.e. there are only 1-2 p's left in the .ini file), just append more exponents to the end of the file- you can do this without stopping the program. For those of you used to the Prime95 worktodo.ini format, remember that Mlucas needs EXPONENTS ONLY in this file, nothing else. (We'll only need the full format when we integrate the factoring modules.) Also, remember to strip off the leading space in the Mxxyyzz Res64: output line before sending any results to PrimeNet. I'll fix this problem in the next release, for now you'll need to do it manually. Happy hunting, - -Ernst _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Tue, 09 Nov 1999 04:20:04 +0100 From: Sturle Sunde <[EMAIL PROTECTED]> Subject: Re: Mersenne: Re: Mlucas on SPARC > perhaps conditional compilation, with each binary incorporating > only the routines it needs for that length, could reduce the code > footprint and help performance It can help if it makes it possible for the compiler to eliminate jumps and branches that way. > Or perhaps the compiler and OS already do a good job at keeping only > needed program segments in cache, in which case the problem lies yet > elsewhere. Usualy. > One other possibility is that the SPARC f90 compiler, being *much* > younger than the (apparently very good) C compiler, has better > support for the v8 and v9 instruction sets, i.e. they didn't put > as much work into including optimizations for older CPUs - I don't > know. A compiler is normaly built of several separate parts: The frontend: o Lexical analyzer, syntax analyzer and semantic analyzer - Usualy tightly coupled and allways language specific. This is the frontend, or the parser. o Intermediate code generator - Takes a syntax tree from the parser and makes intermediate code on a simple format -- simple to produce and simple to make machine code from. Some optimiziations are practical to do here (some things are easier to detect from the syntax tree than from the intermediate code) but most of it is left for the next step. The backend: o Code optimizer - Optimizes the intermediate code in several passes. o Code generator - Outputs the actual code for the specific platform. As you can see it is by design very easy to use the same backend for different languages and opposite, so I doubt that Sun is developing separate backends for different languages. They only need to adjust the frontend to create working intermediate code from a Fortran program, and leave the rest to their already working backend. To support another CPU they only need to change the backend to optimize and create code for that CPU. A lot of the optimizing is also CPU independant. A quick but closer look at how a compiler works and optimizations are done is chapter 14 of "Using and Porting GNUS CC" by Richard Stallman. > We also could use help from any SPARC employees familiar with the > compiler technology to tell us why the f90 -xprefetch option is > so unpredictable - it speeds some runlengths by up to 10%, but more > often causes a 10-20% slowdown (or no change). I'm not a SPARC employee, but I have some credits in graduate level compiler technology. 8-) Prediction is often very hard to do in a compiler. Perfect register allocation is practicaly impossible. You can try and in most cases get a very good result, but not perfect. If you try too hard and guess wrong a few places, you may as well end up doing the oposite of optimization. For the compiler it might look like you need a chunk from the memory in only a short time, and it is "smart" and prefetches everyting well ahead. It might even allocate registers for the data, while the CPU gets time to predict the next branch. Very smart when it works. In reality, however, oposite to what the compiler predicted, you might loop around 1000 more times before you take the other branch. Because the compiler was wrong, you have lost some registers and some data you needed were lost from the cache and filled with data you don't need until you've looped 1000 times more. The only secure way to get this right is to use good profiling feedback which can record information of how often each variable and constant is needed, find patterns in how the program accesses what memory, where branches tend to go, etc, and then use this information to optimize register allocations, prefetching, etc in a second pass trough the compiler. To people specialy interested in performance on Sun, I can recommend chapter 2 of the whitepaper "Delivering Performance on Sun: Optimizing Applications for Solaris", which is availiable here: PostScript: http://www.sun.com/workshop/whitepapers/sol-app-perf.ps Acrobat: http://www.sun.com/workshop/whitepapers/sol-app-perf.pdf You shold also install the latest patches from Sun. There are alot for f90: <URL: http://access1.sun.com/workshop/current-patches.html> - -- Sturle URL: http://www.stud.ifi.uio.no/~sturles/ ~~~~~~ This will end your Windows session. Do you want to play another game? _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Tue, 9 Nov 1999 00:08:30 -0500 (EST) From: jim barchuk <[EMAIL PROTECTED]> Subject: Mersenne: 19.0.2 and ERROR 2250: Server unavailable Hello All! Yes, I see the comments about this in the archives, but there's no fix suggested. Been running mprime like the Energiser Bunny 16.3, 17.1, 18, and now 19.0.2. Using RedHat 4.2 with lots of upgrades, kernel 2.0.33. I haven't been reading this list lately. I did receive email on Oct 10 that V19 was avail, and it is 'faster,' so natually I picked it up. It dropped in and appeared to work fine. BTW I use the static version because dynamic programs rarely like RH 4.2. I look at mprime.log very infrequently. I do look at top usually a few times a day, and mprime is always at the top of the list. Today I glance and it's nowhere to be seen. ps -a says it's running, but using essentially no resources. So I look at the log and find 287 entries of ERROR 2250: Server unavailable dating back to the -day- I started V19 running. It's been - -runnning- since then, but -no- obvious notice of this problem. Yes, I tried the dynamic version, it fails with error in loading shared libraries : undefined symbol: __libc_start_main. Any clues? BTW I -strongly- suggest that when something like this happens that another 'whoopsie, potential serious error' message be broadcast to all participants, not just mention of it in the list. As it stands my best (in fact only AFAIK) option is to fall back to V18 and lose a month's worth of CPU time. Thanks much. Have a :) day! jb - -- jim barchuk [EMAIL PROTECTED] _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Tue, 9 Nov 1999 14:42:52 +0000 From: "Steinar H. Gunderson" <[EMAIL PROTECTED]> Subject: Re: Mersenne: 19.0.2 and ERROR 2250: Server unavailable On Tue, Nov 09, 1999 at 12:08:30AM -0500, jim barchuk wrote: >I haven't been reading this list lately. I did receive email on Oct 10 >that V19 was avail, and it is 'faster,' so natually I picked it up. It >dropped in and appeared to work fine. BTW I use the static version because >dynamic programs rarely like RH 4.2. The problem seems to be that mprime is compiled to glibc 2.1, while RH 4.2 uses... libc5? glibc 2.0? glibc 2.1 (even in static version) needs some special files to operate. This has been discussed on the list before, and perhaps it's time to fix it :-) I could perhaps download and compile glibc 2.1, and compile it with static NSS. Then I could compile mprime (perhaps I'll need George's security.c and security.h, but that's another matter entirely -- we'll take that when it comes) with a version that would work for non-2.1 users. Any protests? >Any clues? For now, you could revert to v18, use v18 only to report your results, report them via the manual forms, or just wait for us to do something about it :-) /* Steinar */ _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Tue, 9 Nov 1999 13:05:43 -0500 From: "Geoffrey Faivre-Malloy" <[EMAIL PROTECTED]> Subject: Mersenne: Testers needed After much anticipation and delay (well, maybe not anticipated because I didn't tell many people about it), Prime95Setup is available. If you'd like to take a few minutes and help test it and make comments/suggestions, I'd appreciate it. You can download it (for now) at: http://www.mindspring.com/~gjf/Prime95Setup.EXE It's about 650kb in size. Yes, I know it's larger but I hope this will help us to get more coverage as it will make it easier to install. Standard disclaimer applies - ya know the one that says I'm not responsible if it formats your hard drive, sends all your private data to Microsoft, etc. :) I've tested it out and I don't think it's going to do that but ya never know :) Anyway, let me know if you find any bugs. Flames should be sent to /dev/null G-Man _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Tue, 9 Nov 1999 18:43:14 -0000 From: "Brian J. Beesley" <[EMAIL PROTECTED]> Subject: Re: Mersenne: 19.0.2 and ERROR 2250: Server unavailable On 9 Nov 99, at 14:42, Steinar H. Gunderson wrote: > The problem seems to be that mprime is compiled to glibc 2.1, while RH 4.2 > uses... libc5? glibc 2.0? glibc 2.1 (even in static version) needs some > special files to operate. This has been discussed on the list before, and > perhaps it's time to fix it :-) I tend to agree... Actually someone (George?) put up mprime & sprime v19.1 over the weekend, which (according to the whatsnew file) has a workaround coded in - though I'm unable to check whether it works, since I've not had problems with v19.0 on any of my systems. > I could perhaps download and compile glibc 2.1, and compile it with static NSS. > Then I could compile mprime (perhaps I'll need George's security.c and > security.h, but that's another matter entirely -- we'll take that when it > comes) with a version that would work for non-2.1 users. Any protests? Would it be reasonable to suggest instead that someone build sprime v19.x on a system running RH 4.x, or something else old enough that the dynamic library problem hadn't been invented? Or does that replace one problem with another, viz. a binary that won't run on current releases? I don't think there is any other significant issue ... ??? > > For now, you could revert to v18, use v18 only to report your results, report > them via the manual forms, or just wait for us to do something about it :-) Reverting to v18 - you'll lose any work in progress & have to restart it from scratch 8-( Using v18 to report results - there would seem to be scope for confusion here - messy! 8-( Reporting via manual forms - sounds OK for a few systems that are relatively easily "got at", but unmanageable for large numbers of systems - especially if they don't have a web browser available. Also will cause loss of PrimeNet credit 8-( Just wait - I think we could & should get something done about this PDQ. One possible option would be to archive the v19 save file(s), stop v19, remove the current assignment from worktodo.ini (by deleting the top line in the file) and restart using v18 for the time being. When a fixed v19 becomes available, replace the save file(s) & the worktodo.ini entry for the assignment. This will preserve work in progress, though the credit will obviously be deferred. Also the option of upgrading to a new version of linux should be considered. The cost is not great (get the CheapBytes CD for less than $5), though there may be an issue with disk space on some systems - linux bloats too, though by no means as badly as windoze! The point here is that there are a number of security problems (especially in sendmail and the ftp daemon) which need attention; upgrading to a recent release is a reasonably convenient way of fixing these. Though note that RH 6.0 _still_ needs wu-ftpd upgraded; everything prior to v2.6.0 is vulnerable, and hackers are actively scanning IP address space looking for systems running wu-ftpd v2.5.x. Regards Brian Beesley _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Tue, 9 Nov 1999 18:43:14 -0000 From: "Brian J. Beesley" <[EMAIL PROTECTED]> Subject: Re: Mersenne: Re: Mlucas on SPARC On 9 Nov 99, at 4:20, Sturle Sunde wrote: > > perhaps conditional compilation, with each binary incorporating > > only the routines it needs for that length, could reduce the code > > footprint and help performance > > It can help if it makes it possible for the compiler to eliminate jumps > and branches that way. Or optimizes jumps & branches. There's a significant performance penalty associated with jumping to an instruction which is badly aligned with the way in which blocks of code are prefetched for decoding. If unexecuted code is confined to prefetch blocks (and, better still, virtual pages) which don't contain code which is actually executed, the associated performance penalty is very small to non existent. > > > Or perhaps the compiler and OS already do a good job at keeping only > > needed program segments in cache, in which case the problem lies yet > > elsewhere. > > Usualy. This actually has a lot to do with the hardware. It's not the size of the cache so much as its organization. If the L1 data cache uses e.g. 32 byte lines, like the Pentium, and is only 4-way associative, you have direct access to _at most_ only 128 bytes of data - however big the cache itself is. Worse, if you access data in such a way that there are multiples of 128 bytes between accessed elements, you will in effect be using only one cache line, and you will get at least four times as many cache misses as you'd expect! The L2 cache helps you out of this particular hole, but there's still a performance penalty. This is an example of the sort of situation that compiler writers find quite hard to deal with. Also there's the point that we're dealing with virtual machines. The best example I can think of to illustrate this is the old VAX problem of initializing a square array: DIMENSION A(1000,1000) DO 10 J=1,1000 DO 10 I=1,1000 10 A(I,J)=0.0 runs very many times faster (well over 100 times!) if you simply swap round the DO statements (assuming that you have a maximum working set size of 512 pages = 256KB, which was typical on early VAX installations). The reason is that one way writes the elements in virtual memory order, with just one demand-zero page fault on first access and just one write dirty page fault on completion; running by "columns" instead of "rows" generates 127 extra re-read and write dirty page faults for each of the 7813 pages referenced. And, with a small working set, many of the page faults would result in actual I/O to the swap/page file as well as the overhead of actually resolving virtual addresses to physical. If you're writing in C, you'd probably use pointers instead of array subscripts to do this particular job; but the required order of the loops for efficient operation would be _reversed_ compared with Fortran. You really do need to use profiling tools to evaluate code; you also need detailed knowledge of the architecture you're working with in order to get things really well optimized. Compiler design & implemetation can and does make a significant difference, but we're a long way from being able to tell a compiler, e.g., "Make me a program to LL test Mersenne numbers; I care about execution speed at the expense of everything else". > The only secure way to get this right is to use good profiling feedback > which can record information of how often each variable and constant is > needed, find patterns in how the program accesses what memory, where > branches tend to go, etc, and then use this information to optimize > register allocations, prefetching, etc in a second pass trough the > compiler. Actually you need to repeat this step until every change you try makes things worse rather than better. Regards Brian Beesley _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Wed, 10 Nov 1999 08:28:31 +1300 (NZDT) From: Bill Rea <[EMAIL PROTECTED]> Subject: Mersenne: Re: Mlucas on SPARC > From [EMAIL PROTECTED] Tue Nov 9 12:10:09 1999 > > >On my Ultra-5 with a small 256Kb L2 cache I get 0.58 secs/iter > >for MLU against 0.78 secs/iter for Mlucas at 512K FFT. > > These timings suggest that it's more than a cache size issue - after > all, Mlucas has a smaller memory footprint irrespective of the CPU, > and one would expect some benefit from this even in cases where > both codes significantly exceed the L2 cache size. I wonder if > the fact that the code itself (due to all the routines needed for > non-power-of-2 FFT lengths, which MLU doesn't support) might be > causing a slowdown here, by competing for space in the L2 cache > with FFT data? I've little experience with this aspect of performance, > but perhaps conditional compilation, with each binary incorporating > only the routines it needs for that length, could reduce the code > footprint and help performance - would any of the computer science > experts care to comment? These low end machines do seem to have a bottleneck feeding the processor from memory. An Ultra-5 is not an old machine. When I first started trying different compiler options I found the timed tests supplied with MLU would give very significant speed differences, but these almost never translated into a speed difference on a real exponent. I had a chance to have a very brief talk with a Sun Engineer about this and he said these low end machines are built to be cost-competitive with PCs and the CPU probably spends a lot of its time "spinning its wheels" while waiting for data to be fed to it. Once I got the iteration time down to 0.6 secs/iter, which was achieved with minimal compiler options, I didn't get any speed improvement at the 512K FFT level until I used the profiling. With that I got it down to the present 0.58 secs/iter. > Or perhaps the compiler and OS already do a good job at keeping only > needed program segments in cache, in which case the problem lies yet > elsewhere. The executable sizes are quite different:- 113216 bytes for MLU 340048 bytes for Mlucas I don't know enough about how the operating system handles code in the cache to know whether this is significant. I would guess on a little 256Kb cache it could make a difference to the speed of execution. For a large 4Mb cache this probably isn't much to worry about. [snip] > We also could use help from any SPARC employees familiar with the > compiler technology to tell us why the f90 -xprefetch option is > so unpredictable - it speeds some runlengths by up to 10%, but more > often causes a 10-20% slowdown (or no change). This would be very helpful. I'm sure Ernst would be happy to let the compiler writers use his code to improve the compiler. (Any Sun engineers on this list?) > Well, I didn't really expect it to make much difference, which is why > I expressed surprise when you said that it did. Was the slowdown you > mentioned for MLU in 64-bit mode also spurious? It wasn't when I reported it, but it is now spurious. With systems with small caches I now think that the compiler option -xspace is very important. This tells the compiler to do no optimizations which would increase code size. Compiling MLU as 64-bit initially resulted in a much bigger executable. With the restart capability it's fairly easy to compile a new binary and within a couple of hours know whether you've improved the speed, but the save files of 32 and 64-bit MLU are not compatible so you have to opt for one or the other on a particular exponent and stay with it to the end. After several recompiles and restarts there is now no noticeable difference in speed between 32 and 64-bit MLU. But I've also mananged to reduce the executable size of the 64-bit version to where it's virtually the same as the 32-bit. Bill Rea, Information Technology Services, University of Canterbury \_ E-Mail b dot rea at its dot canterbury dot ac dot nz </ New Phone 64-3-364-2331, Fax 64-3-364-2332 /) Zealand Unix Systems Administrator (/' _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Tue, 9 Nov 1999 18:41:16 EST From: [EMAIL PROTECTED] Subject: Mersenne: executable size vs. L2 cache (Was: Mlucas on SPARC) Bill Rea wrote: > >On my Ultra-5 with a small 256Kb L2 cache I get 0.58 secs/iter > >for MLU against 0.78 secs/iter for Mlucas at 512K FFT. I wrote: > These timings suggest that it's more than a cache size issue - after > all, Mlucas has a smaller memory footprint irrespective of the CPU, > and one would expect some benefit from this even in cases where > both codes significantly exceed the L2 cache size. I wonder if > the fact that the code itself (due to all the routines needed for > non-power-of-2 FFT lengths, which MLU doesn't support) might be > causing a slowdown here, by competing for space in the L2 cache > with FFT data? Bill replied: >The executable sizes are quite different:- > >113216 bytes for MLU >340048 bytes for Mlucas That is *exactly* the kind of size difference one might expect to cause an appreciable performance difference on a 256KB cache machine. I'll bet the runtime profiling is designed to strip out unused code sections from the executable image, and thus reduce its size. If you still can't get -xprofile to compile decently fast on Mlucas (a problem you noted earlier),there's a manual way to test this hypothesis, namely: 1) Pick an FFT length for testing. Look at the combination of FFT radices used for that N by Mlucas in mers_mod_square. (Example: for N = 224K = 224*1024 = 229376, mers_mod_square lists a set of complex radices (7,8,8,16,16), whose product is 229376/2. 2) Comment out all subroutine calls in mers_mod_square which are not to the routines for the radices in (1). E.g. for the 224K example, in the select case(radix(1)) blocks, comment out all calls except the ones to radix7_dif_pass1, radix7_ditN_cy_dif1 and radix7_dit_pass1. 3) Recompile and compare the the size of the executable to that of the full executable. If it's not substantially smaller, you may have to physically remove the commented-out subroutines from the program file, then recompile. 4) Once your .exe is reasonably small, run some timings. Note that the code compiled for 224K above would also work for any N which uses a combination of radices of the form (7,{any combination of 4,8 or 16},16) (the final radix must always be a 16), i.e. also for 112K and 448K (assuming you're using radices (7,8,16,16,16), not (14,4,16,16,16) for the latter.) If this kind of thing does prove helpful (especially on small-cache and/or bandwidth) systems, once we have a Unix PrimeNet interface to automate execution, it will be relatively easy to replace the current single Mlucas executable with a set of smaller ones, each handling a set of FFT lengths that share the same initial radix, e.g. radix(1) = 3,5,6,7,8,10,14,16, the ones currently supported by Mlucas. The other thing these considerations imply is that compiling in 64-bit mode (due to the large .exe that results) may actually be counterproductive in many instances, unless the code makes heavy use of some 64-bit opcodes which are not supported in 32-bit mode. Let me know what you find, - -Ernst _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Tue, 9 Nov 1999 18:41:14 EST From: [EMAIL PROTECTED] Subject: Mersenne: New Mlucas binary for Alpha/Linux Thanks to Paul Novarese, I have a new Mlucas binary for Alpha/Linux: ftp://209.133.33.182/pub/mayer/bin/ALPHA_LINUX/Mlucas.tgz Paul pointed out to me that the old binary was not statically compiled, i.e. needed separate RTL files, in disagreement with the documentation in my README file. The code is the same as before, just compiled using the -non_shared flag. If this flag works similarly on other compilers, future releases for Alpha Unix and SPARC Solaris will also no longer need separate RTL files (although if you downloaded and installed such already, these would not need to be updated each time the code changes anyway.) Cheers, - -Ernst _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Tue, 09 Nov 1999 22:02:51 -0500 From: George Woltman <[EMAIL PROTECTED]> Subject: Mersenne: Linux error 2250 solved Hi all, Mprime version 19.1 is now available. The only new feature of any consequence is a solution to the error 2250 some users of the staticly linked mprime suffered. The whatsnew.txt file describes the line you need to put in primenet.ini. You must create the primenet.ini file. Many thanks to all that helped narrow the problem down to a difference in the way gethostbyname call worked in different versions of the OS. Regards, George P.S. P-1 factorers - do not use the Stage1GCD undocumented feature. It may have a bug. _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Tue, 9 Nov 1999 23:04:45 -0500 (EST) From: "Vincent J. Mooney Jr." <[EMAIL PROTECTED]> Subject: Mersenne: Results file [Sun Oct 10 16:37:44 1999] Self-test 448 passed! [Mon Oct 18 07:56:26 1999] Iteration: 2390772/9008231, ERROR: SUM(INPUTS) != SUM(OUTPUTS), 6.646378947096065e+016 != 6.638737981499106e+016 Possible hardware failure, consult the readme file. Continuing from last save file. [Mon Nov 08 17:42:12 1999] M9008231 is not prime. Res64: 9304F573B0401CDE. WU1: 0F8DEADF,1083659,00000000 _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Wed, 10 Nov 1999 12:57:18 -0500 From: George Woltman <[EMAIL PROTECTED]> Subject: Mersenne: Re: Setup Testers needed Hi all, Geoffrey's email below needs some clarification. First some history. We know that most programs you download contain a nice setup program rather a plain old zip file. Geoffrey contacted Wise Solutions (http://www.wisesolutions.com) and begged on GIMPS' behalf for a free copy of their Wise for Windows Installer program. They graciously gave us a copy. Geoffrey was able to quickly us a fancy setup program. He reports that Wise's program is excellent. Since the next release of prime95 will use Geoffrey's work it would be nice if a handful of users could give us both some feedback. Be sure to backup your directory before trying it out. Questions include: 1) Does it find your current prime95 folder? 2) Does it install the new version successfully? 3) Can you uninstall? 4) Are their features missing? 5) Could some features be better implemented or worded better? Thanks, George At 01:05 PM 11/9/99 -0500, Geoffrey Faivre-Malloy wrote: >Prime95Setup is available. If you'd like >to take a few minutes and help test it and make comments/suggestions, I'd >appreciate it. You can download it (for now) at: > >http://www.mindspring.com/~gjf/Prime95Setup.EXE > >It's about 650kb in size. _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Thu, 11 Nov 1999 14:56:37 +1300 (NZDT) From: Bill Rea <[EMAIL PROTECTED]> Subject: Mersenne: Re: executable size vs. L2 cache (Was: Mlucas on SPARC) > From [EMAIL PROTECTED] Wed Nov 10 12:42:00 1999 > > Bill replied: > > >The executable sizes are quite different:- > > > >113216 bytes for MLU > >340048 bytes for Mlucas > > That is *exactly* the kind of size difference one might expect to cause > an appreciable performance difference on a 256KB cache machine. I'll > bet the runtime profiling is designed to strip out unused code sections > from the executable image, and thus reduce its size. If you still can't > get -xprofile to compile decently fast on Mlucas (a problem you noted > earlier),there's a manual way to test this hypothesis, namely: > > 1) Pick an FFT length for testing. Look at the combination of FFT > radices used for that N by Mlucas in mers_mod_square. (Example: for > N = 224K = 224*1024 = 229376, mers_mod_square lists a set of complex > radices (7,8,8,16,16), whose product is 229376/2. > > 2) Comment out all subroutine calls in mers_mod_square which are not > to the routines for the radices in (1). E.g. for the 224K example, > in the select case(radix(1)) blocks, comment out all calls except > the ones to radix7_dif_pass1, radix7_ditN_cy_dif1 and radix7_dit_pass1. > > 3) Recompile and compare the the size of the executable to that of the > full executable. If it's not substantially smaller, you may have to > physically remove the commented-out subroutines from the program file, > then recompile. > > 4) Once your .exe is reasonably small, run some timings. Note that the > code compiled for 224K above would also work for any N which uses a > combination of radices of the form (7,{any combination of 4,8 or 16},16) > (the final radix must always be a 16), i.e. also for 112K and 448K > (assuming you're using radices (7,8,16,16,16), not (14,4,16,16,16) > for the latter.) > > If this kind of thing does prove helpful (especially on small-cache > and/or bandwidth) systems, once we have a Unix PrimeNet interface > to automate execution, it will be relatively easy to replace the > current single Mlucas executable with a set of smaller ones, > each handling a set of FFT lengths that share the same initial > radix, e.g. radix(1) = 3,5,6,7,8,10,14,16, the ones currently > supported by Mlucas. > > The other thing these considerations imply is that compiling in > 64-bit mode (due to the large .exe that results) may actually be > counterproductive in many instances, unless the code makes heavy > use of some 64-bit opcodes which are not supported in 32-bit mode. Ernst, The picture gets muddier. I built 6 different executables and tried them for speed on my own Ultra-5; 270Mhz CPU, 256Kb L2 cache, and 128Mb RAM. On five I used profiling, I ran the first one with 512K FFT and 640K FFT before recompiling, all the others were run with the 224K FFT and recompiled. The compiler options used were:- options1 = -fast -xO5 -xsafe=mem -xprefetch -xtarget=native - -xarch=v8plusa -xchip=ultra2i -xprofile=collect:Mlucas options2 = -fast -xtarget=native -xarch=v8plusa -xprofile=collect:Mlucas On recompiling the collect is changed to use, the feedback files are removed between compiles with different options. options3 = -fast -xtarget=native -xarch=v8plusa 1) General purpose Mlucas compiled in 64-bit mode with options1, v9a used in place of v8plusa. 2) Code commented out as per above instructions, 32-bit mode, with options1. 3) General purpose Mlucas, 32-bit mode, with options1 plus -xspace. 4) Unused code deleted from source file, 32-bit mode, with options1. 5) General purpose Mlucas, 32-bit mode, with options2. 6) General purpose Mlucas, options3. Executable no. Size (bytes) Secs/iter on 224K FFT 1 416792 0.291 2 349704 0.265 3 314416 0.304 4 215480 0.281 5 348808 0.276 6 340064 0.327 The compiler manual says about the -fast option "This option provides close to the maximum performance for many realistic applications". It looks like they're right about that, provided you use profiling and recompile, with only (2) running faster. With -fast the -xtarget and -xchip options are set, they're redundant. For comparision MLU does 0.276 secs/iter on a 256K FFT. I also ran my executable (1) on a 512K FFT and it did 0.70 sec/iter. That's a big gain on the generic executable's 0.78 secs/iter I reported earlier, but still a long way from MLU's 0.58 secs/iter. To get the profiling to work I added a lot more swap space. As you may recall the compiler was running out of address space and dieing. It takes close to 2 hours to build an executable with profiling, run it, then recompile using the results of the profiling. The compiler grows to take over 240Mb of address space while optimizing and the disk rattles continuously as the system pages. Bill Rea, Information Technology Services, University of Canterbury \_ E-Mail b dot rea at its dot canterbury dot ac dot nz </ New Phone 64-3-364-2331, Fax 64-3-364-2332 /) Zealand Unix Systems Administrator (/' _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ End of Mersenne Digest V1 #658 ******************************
