Mersenne Digest Tuesday, August 24 1999 Volume 01 : Number 618 ---------------------------------------------------------------------- Date: Sat, 21 Aug 1999 18:59:22 EDT From: [EMAIL PROTECTED] Subject: Mersenne: Re: DS20 timings Simon Burge writes (Hi, Simon!): >It looks like the DEC C compiler has the same problem - adding "-assume >accuracy_sensitive" to the compiler command line fixed the problem where >MacLucasUnix thought it could do 33219281 with a 1M FFT Yes, MacLucasUnix uses the same fast round strategy, so would suffer the same fate under aggressive optimization unless care is taken. >Have you got 2.6x and 2.6a mixed up? From what I understand, David used 2.6x for the double check, and 2.6a is currently on your ftp site. I hope not - otherwise, what was I working on for most of the past week? :) Actually, you're right - I created 2.6b and executables on another server (the one which has my F90 license), and in my haste to transfer the final files to my webserver and finish writing my posting I neglected to transfer the 2.6b source to the webserver (the executables were there). It's there now. The 2.6a source has been moved to the /pub/archived directory, for those of you who want to see what changes I made to the source to squeeze more speed out of the code. >These results are for a copy of Mlucas_2.6a.f90 I compiled with: > > f90 -o lm -tune ev6 -O5 lucas_mayer_V2.5b.f90 Many thanks for the timings! Looks like Mlucas is no more than 20% slower than MacLucasUnix at any runlength, and catches up with it at 1024K. Now get ready for a 15-20% further speed boost... The good news is, v2.6b should be about 10% faster (esp. at large FFT lengths than your timings of 2.6a. Note that for 2.6b I've precompiled 3 separate executables for the three main Alpha generations (ev4 = 21064, ev5 = 21164, ev6 = 21264). If you have time, try compiling the 2.6b source locally with various options and see if you can beat the ones I used for the ev6 executable (but please see me note about the timings below!) >The Mlucas_2.6a.exe from your FTP site gives the same results but is >slightly slower (around 5%) on the DS20: That 5% is likely due to the fact that the 2.6a .exe was compiled for a generic platform, i.e. sans ev4/5/6-specific tunings. > % time ./Mlucas_2.6a.exe < foo Note that on platforms which support the real*16 data type, you can't just do a single timing run and get an accurate timing. That's because if real*16 (actually, any extended-precision floating data type with at least 18 decimal digits of precision - I coded it that way to access the x86 80-bit register type, with compilers that can make use of it) is available, the program auto-detects it and uses it for high-precision FFT sincos data initializations. This may take up to the equivalent of 40-80 iteration times. To subtract this initialization component, you need to do 100 and 200-iteration timing runs, subtract the difference and divide by 100. (The README file describes this.) I look forward to the revised (and hopefully improved) ev6 timings! Brian Beesley writes (Hi, Brian!): >> Hmm, those upper limits seem a bit low. Does MacLucasUnix tell you when >> the exponent is too large? > >The way I understand MacLucasUNIX _should_ work is that it guesses >that the FFT size the next power of 2 above exponent/32 might work. >This is usually hopelessly optimistic. However, it monitors the round- >off errors for the first "few" (hundreds) of iterations, if the round- >off errors are too large then it restarts with double the FFT size. See Simon's note about the accuracy_sensitive option being needed for this to work properly. >> size: 128K 256K 320K 384K 448K 512K 640K 768K 896K 1024K >> pmax: 2.62M 5.20M 6.46M 7.71M 8.96M 10.2M 12.6M 15.1M 17.5M 20M >> >> On strictly real*8-type hardware, the upper limits are about 1% less. >> The only cost of real*16 inits is a greater initialization time. > >Does this not make assumptions about the accuracy of the trig >functions in the library, also does it take into account the effect >of any "magic numbers" used in the DWT to achieve remaindering modulo >2^p-1 ? Sure it makes some assumptions, but not unreasonable ones. I know this because the SPEC98-submitted version of Mlucas has been tried on a huge variety of platforms and compiled with many different compilers, and the "1% rule" has proved a good one. Even with the most aggressive compile options and specifying less-accurate trig functions that are faster (a silly thing to do with Mlucas, since one only computes them once, so why not spend a few seconds more and get better accuracy?), one could still always get a pmax at least 98% of the ones listed above. The DWT weight factors (your "magic numbers") obey similar rules, i.e. there's nothing special about them (at least to the F90 compiler - perhaps the C compiler uses a different library? That would be dumb, but not shocking) that should cause them to be much more inaccurate than the trig functions, assuming the intrinsic function library was well-written (that may not be a good assumption on all platforms, unfortunately.) >> Now, if we could squeeze gains out by assembly coding anywhere near those >> Jason Papadopoulos obtained on the SPARC via the ASM route, that would >> be impressive, indeed. > >I don't know what the F90 compilers do - but gcc can produce ASM from >C source files - if I had loads of time, I'd start from that & a >profile listing of where the program spends most of its time, & see >how much I could improve those bits of the code by massaging the ASM >output by the compiler. Pretty much what I had in mind for starters, except I'd rather spend my time working on basic algorithmic improvements and code functionality (e.g. adding p-1 factoring.) Any ASM experts out there with some free time? Cheers, Ernst _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Sat, 21 Aug 1999 21:18:52 EDT From: [EMAIL PROTECTED] Subject: Mersenne: ONLINE.EXE malfunction? Hrm. [STL clicks on ONLINE.EXE] The ONLINE.EXE file is linked to missing export WININET.DLL:InternetGetConnectedState. ["Yeah, yeah." STL clicks OK.] C:\TEMP\ONLINE.EXE A device attached to the system is not functioning. ["Auuugh! You hunk of junk computer!" STL pounds his fist on the desk for emphasis.] Anyone knows what caused this error message? I'm not emailing this directly to Mr. Kurowski because it's not working right, and he's probably flooded with successful results. :-S <<- type of Internet connection (dialup modem, LAN, DSL, ISDN, etc.)>> Dialup modem, US Robotics something or other. <<- output of program test run(s)>> A malfunction. :-O <<- if the connection was really open or not>> It was open. <<- ISP connection method & if applicable, version (direct, Windows dialup networking [DUN], AOL 4.0, Compuserve 3.0, etc.)>> I dialup to AOL, version 3.0 for Windows 95, revision 131.62. << - web browsers & versions installed (Netscape 3, IE4.0, IE5, etc.)>> IE3 and Netscape 3, but AOL uses the former. << - Operating system (Win95, Win98, NT 4.0 workstation or server)>> Win95 OSR2. Is this problem software based? The mention of a DLL makes me think so... Oh well, S.T.L. _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Sun, 22 Aug 1999 05:33:57 +0200 (CEST) From: Henrik Olsen <[EMAIL PROTECTED]> Subject: Re: Mersenne: ONLINE.EXE malfunction? On Sat, 21 Aug 1999 [EMAIL PROTECTED] wrote: > Anyone knows what caused this error message? I'm not emailing this directly > to Mr. Kurowski because it's not working right, and he's probably flooded > with successful results. :-S He wrote the code, he's the one entitled to be told of bugs. How do you expect him to be able to write code that works on every machine if you won't tell him when it doesn't work. - -- Henrik Olsen, Dawn Solutions I/S URL=http://www.iaeste.dk/~henrik/ Thomas Covenant: I am the savior of The Land. Linden Avery: Can I help? Thomas Covenant: Over my dead body.(dies) (Linden Avery saves The Land.) The Second Chronicles of Thomas Covenant, Book-A-Minute version _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Sun, 22 Aug 1999 15:06:32 +1000 From: Simon Burge <[EMAIL PROTECTED]> Subject: Mersenne: Re: DS20 timings [EMAIL PROTECTED] wrote: > Simon Burge writes (Hi, Simon!): Howdy! > The 2.6a source has been moved to the /pub/archived directory, for those > of you who want to see what changes I made to the source to squeeze more > speed out of the code. I'm curious - there seems to be definite patterns in the sections of code (especially when comparing the different length radix subroutines). Is any of this code machine generated, or is it all hand generated? In the past I attempted to get a C version of your code working but didn't quite get there, and then Gord Palameta did a version last year. I was thinking of taking a stab at trying 2.6b again... > I look forward to the revised (and hopefully improved) ev6 timings! Here's some more times with a few explainitory notes below: Platform/per-iteration time (sec) 500MHz 21264 500MHz 21264 500MHz 21264 64kB I-cache 64kB I-cache 64kB I-cache 64kB D-cache 64kB D-cache 64kB D-cache 4MB L2 4MB L2 4MB L2 Mlucas2.6a[1] Mlucas2.6b[2] Mlucas2.6b[3] [4] FFT length: ------------- ------------- ------------- ------- 128K 0.043 0.040 0.041 0.001 160K 0.051 0.049 0.053 0.003 192K 0.062 0.062 0.066 0.004 224K 0.081 0.081 0.081 0.001 256K 0.10 0.089 0.093 0.003 320K 0.12 0.112 0.118 0.005 384K 0.16 0.14 0.15 0.009 448K 0.19 0.18 0.18 0.003 512K 0.21 0.19 0.20 0.004 640K 0.28 0.25 0.26 0.015 768K 0.38 0.34 0.36 0.011 896K 0.40 0.38 0.38 0.006 1024K 0.46 0.42 [1] - This is the average for the first N iters INCLUDING SETUP TIME. N ranges from 8000 for a 128K FFT to 500 for a 1024K FFT. [2] - This is the average for the time difference between a 100 iter run and a 200 iter run as described in the README file and completely ignored by me the first time around :-) This using my binary compiled with: f90 -arch ev6 -tune ev6 -fast -O5 -assume accuracy_sensitive [3] - As for [2], but using Ernst's Mlucas_2.6b.exe.ev6 binary. In both [2] and [3], the residues matched. [4] - The speed increase in secs/iter between my binary and Ernst's binary. Ernst - what compiler options did you use? Simon. _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Sun, 22 Aug 1999 15:12:26 +1000 From: Simon Burge <[EMAIL PROTECTED]> Subject: Re: Mersenne: LL Test Program verification data "Brian J. Beesley" wrote: > I've posted on my ftp server a file which may be of interest to > anyone developing or verifying the operation of LL testing programs. > > The URL of the file is > ftp://lettuce.edsc.ulst.ac.uk/gimps/PrimeQA/QADATA.TXT > > The format is > exponent,# iterations,residual,program id,max error > > The residual listed is the hexadecimal representation of the low > orfer 64 bits of the complete residual. The program id is always > "lucdwt"; the max error can be ignored (basically it's there as a > self-check to make sure that a sensible FFT size was used to generate > the data). > > The range of exponents is selected exponents between 8,000 and > 80,000,000. The low 64 bits of the residual after 400 iterations is > listed for every exponent in the file. (After one iteration the > residual should be 000000000000000E for every exponent). For > exponents less than 4,000,000, the residual after 1,000 iterations is > also listed. For exponents greater than 20,000,000, the residual > after 100 iterations is also listed. > > The results were generated on a 533 MHz Alpha 21164LX system; the > program used is a version of lucdwt (included in Richard Crandall's > giantint package), modified to produce output in a suitable format. > > I hope that this data assists anyone developing or modifying LL > testing software to ensure that their code is operating correctly. I've got some patches to Will's mers package to print out results that can be compared with this. For example, here's the first few lines of QADATA.TXT: 64511,100,EBA1B280B7C00FF7,lucdwt,0.000107 64511,200,96F428A92412ADA8,lucdwt,0.000122 64511,300,38498B59F3A614B7,lucdwt,0.000122 64511,400,3389BC878321980A,lucdwt,0.000122 64513,100,D24F0966C2E9B3C4,lucdwt,0.000092 64513,200,2376508C726DFB47,lucdwt,0.000092 64513,300,D4BA71A0879202D5,lucdwt,0.000092 64513,400,FF4069B4F7F0BDA4,lucdwt,0.000095 and the start of a test run of MacLucasUNIX running on an UltraSPARC: M( 64511 )C, 0xeba1b280b7c00ff7, iter = 100, n = 4096, MacLucasUNIX v6.20 Sweeney M( 64511 )C, 0x96f428a92412ada8, iter = 200, n = 4096, MacLucasUNIX v6.20 Sweeney M( 64511 )C, 0x38498b59f3a614b7, iter = 300, n = 4096, MacLucasUNIX v6.20 Sweeney M( 64511 )C, 0x3389bc878321980a, iter = 400, n = 4096, MacLucasUNIX v6.20 Sweeney M( 64513 )C, 0xd24f0966c2e9b3c4, iter = 100, n = 4096, MacLucasUNIX v6.20 Sweeney M( 64513 )C, 0x2376508c726dfb47, iter = 200, n = 4096, MacLucasUNIX v6.20 Sweeney M( 64513 )C, 0xd4ba71a0879202d5, iter = 300, n = 4096, MacLucasUNIX v6.20 Sweeney M( 64513 )C, 0xff4069b4f7f0bda4, iter = 400, n = 4096, MacLucasUNIX v6.20 Sweeney The tests are still running - when it is finished I'll write a little perl script that compares the two files and prints out any error messages. Simon. _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Sun, 22 Aug 1999 02:24:50 EDT From: [EMAIL PROTECTED] Subject: Mersenne: Re: LL Test Program verification data Brian Beesley writes: >I've posted on my ftp server a file which may be of interest to >anyone developing or verifying the operation of LL testing programs. Mlucas users please note: my code treats the initial residue (4) as "iteration one," rather than the more-common zero. (I guess that makes it obvious I'm a Fortran guy, eh? :) Thus, to compare to, e.g. Brian's 400-iteration residue for a given exponent, you'd need to specify 401 iterations using the Mlucas timing test feature. (I'd actually prefer Brian use 10-iteration residues, since then Mlucas would literally "go to eleven," but I realize ten iterations is too few for a decent QA test.) Also, Mlucas requires the exponent to be prime (actually a strong probable pseudoprime), whereas most of the exponents in Brian's QA file are composite. Brian, any chance you could flag the prime exponents in the file? (you could write a small code to do this automatically.) Cheers, Ernst _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Sun, 22 Aug 1999 02:24:49 EDT From: [EMAIL PROTECTED] Subject: Mersenne: Re: DS20 timings Simon Burge writes: >I'm curious - there seems to be definite patterns in the sections of >code (especially when comparing the different length radix subroutines). >Is any of this code machine generated, or is it all hand generated? All hunt-and-peck hand-generated. Of course many sections of code are very similar, so much was created via cut-paste-modify. That's what I meant about the out-of-place transform strategy being nice from a coding and debugging (and now, speed) perspective. The best place to read about the speedups and view the evolution of the code is the program header. >In the past I attempted to get a C version of your code working but didn't >quite get there, and then Gord Palameta did a version last year. I was >thinking of taking a stab at trying 2.6b again... I agree, it would be nice to have a C version, but as long as there are executables for the major platforms, it's not crucial. I'll have a Linux binary hopefully by middle of next week. Since we have Alpha Unix and SGI Irix binaries, the major voids are SPARC and the PowerPC (and its successor, the AltiVec). Hopefully when Alex Kruppa returns to a quasi-normal routines (and passes his TU Mu"nchen exams), one of the three f90 compilers he said he'd try on his Ultra will prove to be a decent one. Of course, PPC and SPARCers can (and should) use MacLucasUnix in the meantime. There's also Jason Papadopoulos' potential LL code, but even with lots of encouragement it'll likely take him months to modify his Fermat number code appropriately, and even longer to add non-power-of-2 runlengths. As anyone who has coded such algorithms knows, it's highly nontrivial, especially when you need the code to not just run, but to run fast. (I've been working on Mlucas for nearly three years - of course much of year 1 was spent just learning the basics of FFTs, the LL test, and the DWT.) >Here's some more times with a few explainitory notes below: Thanks! Those look very good, no worse than a few percent worse than MacLucasUnix at any power-of-2 runlength, and quite a bit faster at most of the intermediate lengths and at 1024K. >[4] - The speed increase in secs/iter between my binary and Ernst's > binary. Ernst - what compiler options did you use? I didn't use -arch ev6, used -O4 rather than -O5 (-O5 runs slower on my 21064, where I compile) and didn't use -fast (also a tad slower on my ev4). I'll compile using your options (at least for the ev6 executable) sometime in the next few days. The difference is slight, but I know every bit counts- e.g. for my Fermat-DWT code, every 1% speedup saved nearly 2 days of runtime on F24 (on a 250 MHz MIPS R10000). Cheers and happy hunting, Ernst _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Sun, 22 Aug 1999 21:45:31 EDT From: [EMAIL PROTECTED] Subject: Mersenne: K7 vs. x86 There have been some recent posts about the AMD K7 (Athlon), and whether properly tuning the Prime95 code for the K7 could result in better performance than on a similar-clock-rate Pentium. Of course George W. can comment definitively, but in theory the K7 should be capable of delivering rather better performance. Here's my reasoning: 1) The K7 has essentially the same ultra-high-performance bus as the Alpha 21264, which I believe is 128 bits wide and runs at 200MHz, thus is capable of feeding the hungry processor with significantly more data; 2) The K7 has 3 functional units in its FPU, compared to just two for every other high-end microprocessor I am aware of. All high-end processors have a floating adder and multiplier. I'm not sure what the third unit on the K7 does, but I suspect it's either a second floating adder, or perhaps capable of doing a fused multiply-add. In either case, the, um, added add capability could be exploited to speed an FFT. That's because a typical higher-radix FFT implementation is dominated by adds rather than multiplies. For example, a typical complex radix-16 FFT pass in my Mlucas code takes 16 complex data (32 8-byte floats), and including multiplies by "twiddle" factors (FFT sincos data) does 168 FADDs and 88 FMULs on them- that's nearly twice as many adds as multiplies, i.e. the floating multiplier is idle nearly half the time. If one could do two adds per cycle, one could remedy that imbalance and (neglecting the fact that memory traffic is also responsible for many processor stalls) in theory nearly double the speed. I've always thought the cheapest way to improve performance of a transform- based code (since floating hardware is very expensive) would be to design a floating adder that could take two data x and y and return both the sum and difference, x+y and x-y, in one clock. This would be a much cheaper way to speed an FFT than adding an entire third unit to the FPU, since much of this pair of operations is the same (for instance, the exponent compare and mantissa shift are the same for the sum and diference, i.e. need only be done once.) One would think, e.g. a floating adder and a fused add/multiply unit (in lieu of the pure multiplier) might be capable of giving a similar performance boost, but in practice it appears difficult or impossible to schedule the above operations so as to be doable in <= 100 clocks with such a combination. Of course people don't just run FFTs, which is why fused multiply-add is preferred by hardware designers over fused add-subtract. - -Ernst _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Sun, 22 Aug 1999 21:30:21 -0600 From: "Aaron Blosser" <[EMAIL PROTECTED]> Subject: RE: Mersenne: K7 vs. x86 > 2) The K7 has 3 functional units in its FPU, compared to just two for > every other high-end microprocessor I am aware of. All high-end processors > have a floating adder and multiplier. I'm not sure what the third unit on > the K7 does, but I suspect it's either a second floating adder, or perhaps > capable of doing a fused multiply-add. In either case, the, um, added add > capability could be exploited to speed an FFT. That's because a typical > higher-radix FFT implementation is dominated by adds rather than > multiplies. I had discussed the Athlon briefly with George, and his understanding was that Athlon could do an fadd and an fmul in the 2 FPU units at the same time. This, according to him, is not that useful unless you could really work hard to make sure both engines are constantly fed with work. I've confirmed this architecture on the Athlon (visit http://www.amd.com/products/cpg/athlon/pdf/fpu_wp.pdf for a nice overview of the Athlon FPU). Sure enough, there are 3 FPU units. One that does fadd (and 3dnow), one doing fmul (and 3dnow) and one doing fstore (loads/stores/etc). I wonder if those benchmarks, like spec_fp, run in such a way that they do mixes of fadds and fmuls, keeping all 3 pipes constantly busy. If so (and I bet it does), no wonder the Athlon kicks butt on the FP benchmarks (~137% faster for a PIII-550 vs. Athlon 550 on specfp_base95). But as we can see from the real world benchmarks, even though the Athlon still does better, clock for clock, than a PIII, the difference isn't as great (only ~107% faster on Winbench 99 FPU Winmark). The benchmark was done with a PIII Xeon with 512KB L2 cache. I wonder if having a Xeon with a 2MB cache (or even 1MB) would have made this any closer since the Winbench test is a lot larger than the specfp tests, and a larger cache would likely have more impact. Hmmm... So, the question I put out is, can the algorithm be tweaked in any way so that fadds and fmuls can be done together, or is it, as George initially suspects, worthless? Having the pipe just for doing fstore is probably where we'd see most of the performance benefits, and according to the AMD supplied chart, the execution time of fmul is less (4=Athlon, 5=PIII), however fadd takes longer on the Athlon (4=Athlon, 3=PIII). If you wanted to do extended double fdivs, the Athlon would definitely kick butt (24=Athlon, 37=PIII), and you see the same thing with the fsqrt (extended) as another example. If you were pipelining fmuls, the Athlon could spit them out in 1 clock cycle (after the 4 cycle latency), compared to 2 cycles (after the 5 cycle latency) on the PIII, so it would be REAL important to get lots of yummy pipelined fmuls to the Athlon to really let it strut it's stuff. Aaron _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Mon, 23 Aug 1999 00:15:14 EDT From: [EMAIL PROTECTED] Subject: Mersenne: How fast can we square? (Was: K7 vs. x86) What I neglected to state explicitly in my last posting is that the fact that there is a lower limit to the adds in an FFT also sets a limit on the achievable performance. Let's consider a Mersenne-mod DWT-based squaring with a (real) vector length of 2^20, or just over a million, performed using 64-bit floats on a CPU which can execute one floating add per clock cycle (i.e. most processors). When one includes the FFT twiddle multiplies, a typical number is that we need 3 adds per float per power of 2 in the vector length, i.e. 30 per float to effect a length-2^20 transform. Going to higher radices and highly optimized transforms shaves a little bit off this figure, but mainly has the effect of reducing the number of floating multiplies and allowing one to do lots of operations while data are in registers, i.e. of the processor well fed. We thus estimate 60 FADDs per datum for a forward FFT, and an equal number for the inverse FFT. Throw in some adds for the pointwise square and the carry step, and 135-150 adds per float per iteration sounds reasonable. We thus need to do around 150 million adds per iteration. On a 450 MHz CPU, even if one does the theoretical maximum of one add per clock, each iteration will need a third of second. That math applies to the IA-64, so we can expect good performance from the code George is presumably writing for that CPU, but not miracles - George tells me his Prime95 code needs about 2/3 second to do a length-2^20 squaring on a 450MHz PII, so we can expect at most a factor of 2 speedup on a similar-speed Merced. Looking at my Mlucas code, on a 500MHz Alpha 21264 (also at most one add per clock) it needs 0.4 seconds to do the above squaring, which is not much longer than the estimated minimum time of 0.3 seconds. This tells me that a huge ASM coding effort targeted at the 21264 is not justified, since one can't hope to squeeze much more speed out of the code that way. On the other hand, the code runs 3x as slow on an equal-speed 21164, so trying to boost the performance on that architecture by improving the source code and doing some assembly coding should be worth some effort. - -Ernst _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Mon, 23 Aug 1999 00:31:55 -0400 (EDT) From: Jason Stratos Papadopoulos <[EMAIL PROTECTED]> Subject: Re: Mersenne: K7 vs. x86 On Sun, 22 Aug 1999 [EMAIL PROTECTED] wrote: > For example, a typical complex radix-16 FFT pass in my Mlucas code > takes 16 complex data (32 8-byte floats), and including multiplies by > "twiddle" factors (FFT sincos data) does 168 FADDs and 88 FMULs on them- > that's nearly twice as many adds as multiplies, i.e. the floating multiplier > is idle nearly half the time. If one could do two adds per cycle, one could > remedy that imbalance and (neglecting the fact that memory traffic is also > responsible for many processor stalls) in theory nearly double the speed. I've wondered for a while if the DFT can be re-factorized to have *more* multiplies and fewer total ops. From the doodling I've done, though, it seems that the ratio can never be better than 3:2 adds to muls. There was a paper in (I believe) Appl. Math and Comp. from about two years ago that reformulated radix-2,3,4 and 5 FFT butterfiles to use multiply-adds wherever possible. The savings can be impressive: a radix-2 FFT butterfly has 4 multiplies and six adds, but this boils down to six multiply-adds. Can't the RS6000 do two multiply-adds per clock? Or, if you just want to cheapen a complex multiply, turn the standard form xr,xi * yr,ri -> xr*yr-xi*yi, xr*yi+xi*yr into xr,xi * yr,ri -> xr*(yr - xi/xr * yi), xr*(yi + xi/xr * yr) and precompute xr and xi/xr. Presto! 2 multiplies and 2 multiply-adds. > I've always thought the cheapest way to improve performance of a transform- > based code (since floating hardware is very expensive) would be to design > a floating adder that could take two data x and y and return both the sum > and difference, x+y and x-y, in one clock. This would be a much cheaper way to > speed an FFT than adding an entire third unit to the FPU, since much of this > pair of operations is the same (for instance, the exponent compare and > mantissa > shift are the same for the sum and diference, i.e. need only be done once.) Analog Devices' SHARC DSP chips have a fused add-subtract (and a fused multiply add); a 50MHz SHARC can do an FFT in only twice as long as a 300MHz PPC 604e (sorry, single precision only). jasonp _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Mon, 23 Aug 1999 11:31:37 -0500 From: "Willmore, David" <[EMAIL PROTECTED]> Subject: RE: Mersenne: K7 vs. x86 > From: [EMAIL PROTECTED] [SMTP:[EMAIL PROTECTED]] > 1) The K7 has essentially the same ultra-high-performance bus as the > Alpha 21264, which I believe is 128 bits wide and runs at 200MHz, thus > is capable of feeding the hungry processor with significantly more data; > Sorry, system data bus from Athlon to core logic is SDATA[63:0]. It's the signal levels and most of the overall logical design that the K7 takes from the EV6 (21264). Now, AMD has promised server versions of the Athlon and hinted that they might need a different slot connector. This would fit with a wider data bus. Not, the 21264 cannot plug into the 'slot A' connector without losing half of it's data bus width. Isn't there a 21264PC or some such? I believe those were half width data bus types of chips. One of them might work in a slot A. Keep in mind, that the Athlon bus as it is in slot A is already a match for current generation SDRAM. Of course, DDR-SDRAM will be out soon as will higher clock rate SDRAM. They could always block/interleave the memory, too. Cheers, David _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Mon, 23 Aug 1999 20:28:59 +0100 From: "Brian J. Beesley" <[EMAIL PROTECTED]> Subject: RE: Mersenne: K7 vs. x86 On 22 Aug 99, at 21:30, Aaron Blosser wrote: > But as we can see from the real world benchmarks, even though the Athlon > still does better, clock for clock, than a PIII, the difference isn't as > great (only ~107% faster on Winbench 99 FPU Winmark). The benchmark was > done with a PIII Xeon with 512KB L2 cache. This is _still_ remarkable, since the "consumer" Athlons starting to trickle onto the market have 64-bit 100 MHz FSB and 512KB L2 cache, like PII / PIII / Xeon, but run their L2 cache at only 1/3 clock speed (c.f. full clock speed for Xeon & 1/2 clock speed for PII/PIII) The high performance Athlons with 128 bit FSB @ 200 MHz & larger, relatively faster L2 cache should be really impressive. Maybe starting to approach what Alpha has been doing for a while ;-) (The critical difference here is that Athlon does run native IA32 code!) > So, the question I put out is, can the algorithm be tweaked in any way so > that fadds and fmuls can be done together, or is it, as George initially > suspects, worthless? Hmm. Is is not possible to use two "threads" (NOT in the OS sense - I mean working on two streams of data, interleaved) & arrange them so that the FMULs for stream 1 happen when stream 2 is doing FADDs on independent data, & vice versa? I think you should be able to keep the pipes pretty full ... though whether you can do a _lot_ better than a decent compiler depends on how smart the instruction prefetch / decode / scheduling in the processor core is. You probably _won't_ be able to achieve 100% utilization of all the pipelines all the time - - I've a sneaky suspicion that George's Pentium code drives the memory bus quite hard already; feeding a faster, more efficient processor from the same bus bandwidth is going to make it more likely that bus saturation, rather than processor throughput capability, will limit the ultimate performance of the system as a whole. > > If you were pipelining fmuls, the Athlon could spit them out in 1 clock > cycle (after the 4 cycle latency), compared to 2 cycles (after the 5 cycle > latency) on the PIII, so it would be REAL important to get lots of yummy > pipelined fmuls to the Athlon to really let it strut it's stuff. Am I missing something here? I thought that the _throughput_ was 1 FMUL per clock, but there was a 4 clock period between the instruction entering the execution unit (meaning that it has already had to be prefetched, decoded and the operands made available) and the result of the operation becoming available. So, provided there is no delay in fetching instructions, there is capacity in the decoders and there is no stall due to operands being unavailable, you _should_ get a throughput of 1 FMUL per clock (assuming that you aren't also scheduling other instructions which block the multiplier execution unit, or use its pipeline). Regards Brian Beesley _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Mon, 23 Aug 1999 16:14:02 -0500 From: "Willmore, David" <[EMAIL PROTECTED]> Subject: RE: Mersenne: K7 vs. x86 > From: Brian J. Beesley [SMTP:[EMAIL PROTECTED]] > This is _still_ remarkable, since the "consumer" Athlons starting to > trickle onto the market have 64-bit 100 MHz FSB and 512KB L2 cache, > like PII / PIII / Xeon, but run their L2 cache at only 1/3 clock > speed (c.f. full clock speed for Xeon & 1/2 clock speed for PII/PIII) > Well, they do run the point-to-point bus clock at 100MHz, but they send/receive data on each clock transition, so it's 200M-transfers/sec. And, I thought the production Athlons were using 1/2 speed cache--it was only the pre-production (evaluation) processors which were using the 1/3 speed. Of course, I could easily be wrong. > The high performance Athlons with 128 bit FSB @ 200 MHz & larger, > relatively faster L2 cache should be really impressive. Maybe > starting to approach what Alpha has been doing for a while ;-) (The > critical difference here is that Athlon does run native IA32 code!) > Yes, 400 M-transfers/second at 16 bytes each is a nice 6.4GB/s. :) Run the 8M L2 at 1:1 with the processor at 800MHz and get twice that. :) I can't find a line in the document stating the width of the L2 bus, but I would be suprised if it's < 128 bits. 256 for a server version would be nice. 12.8 GB/s to 25.6 GB/s, geezz.... > > If you were pipelining fmuls, the Athlon could spit them out in 1 clock > > cycle (after the 4 cycle latency), compared to 2 cycles (after the 5 > cycle > > latency) on the PIII, so it would be REAL important to get lots of yummy > > pipelined fmuls to the Athlon to really let it strut it's stuff. > > Am I missing something here? I thought that the _throughput_ was 1 > FMUL per clock, but there was a 4 clock period between the > instruction entering the execution unit (meaning that it has already > had to be prefetched, decoded and the operands made available) and > the result of the operation becoming available. So, provided there is > no delay in fetching instructions, there is capacity in the decoders > and there is no stall due to operands being unavailable, you _should_ > get a throughput of 1 FMUL per clock (assuming that you aren't also > scheduling other instructions which block the multiplier execution > unit, or use its pipeline). > I believe what he's saying is that there is a bit of 'granularity' to the pipe in the PII. It sounds like a FMUL and only enter the pipe every other cycle--to emerge five cycles later. If so, that's what used to be called 'superpipelined'. Hmmm, no, that would be backwards. Maybe it's 'subpipelined'. Think of it as a 2.5 cycle long pipe running at half core speed. This can result from a not fully pipelined stage in the middle of the pipe. Say single precision goes: stageA, stageB, stageC, but double precision goes: stageA, stageB, stageB, stageC, stageC That way, stage B is used for two successive cycles by the same operand. This will force stage A to stall waiting for the following stage to clear. This isn't all that unlikely when normally single precision stages are available. Cheers, David _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ Date: Tue, 24 Aug 1999 00:15:37 -0700 From: Luke Welsh <[EMAIL PROTECTED]> Subject: Mersenne: Ramanujan on a bad day via a correspondent in Brasil... http://www.austms.org.au/Gazette/Nov97/brown.html _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers ------------------------------ End of Mersenne Digest V1 #618 ******************************
