> From [EMAIL PROTECTED] Wed Nov 10 12:42:00 1999
> 
> Bill replied:
> 
> >The executable sizes are quite different:-
> >
> >113216 bytes for MLU
> >340048 bytes for Mlucas
> 
> That is *exactly* the kind of size difference one might expect to cause
> an appreciable performance difference on a 256KB cache machine. I'll
> bet the runtime profiling is designed to strip out unused code sections
> from the executable image, and thus reduce its size. If you still can't
> get -xprofile to compile decently fast on Mlucas (a problem you noted
> earlier),there's a manual way to test this hypothesis, namely:
> 
> 1) Pick an FFT length for testing. Look at the combination of FFT
> radices used for that N by Mlucas in mers_mod_square. (Example: for
> N = 224K = 224*1024 = 229376, mers_mod_square lists a set of complex
> radices (7,8,8,16,16), whose product is 229376/2.
> 
> 2) Comment out all subroutine calls in mers_mod_square which are not
> to the routines for the radices in (1). E.g. for the 224K example,
> in the select case(radix(1)) blocks, comment out all calls except
> the ones to radix7_dif_pass1, radix7_ditN_cy_dif1 and radix7_dit_pass1.
> 
> 3) Recompile and compare the the size of the executable to that of the
> full executable. If it's not substantially smaller, you may have to
> physically remove the commented-out subroutines from the program file,
> then recompile.
> 
> 4) Once your .exe is reasonably small, run some timings. Note that the
> code compiled for 224K above would also work for any N which uses a
> combination of radices of the form (7,{any combination of 4,8 or 16},16)
> (the final radix must always be a 16), i.e. also for 112K and 448K
> (assuming you're using radices (7,8,16,16,16), not (14,4,16,16,16)
> for the latter.)
> 
> If this kind of thing does prove helpful (especially on small-cache
> and/or bandwidth) systems, once we have a Unix PrimeNet interface
> to automate execution, it will be relatively easy to replace the
> current single Mlucas executable with a set of smaller ones,
> each handling a set of FFT lengths that share the same initial
> radix, e.g. radix(1) = 3,5,6,7,8,10,14,16, the ones currently
> supported by Mlucas.
> 
> The other thing these considerations imply is that compiling in
> 64-bit mode (due to the large .exe that results) may actually be
> counterproductive in many instances, unless the code makes heavy
> use of some 64-bit opcodes which are not supported in 32-bit mode.

Ernst,

The picture gets muddier. I built 6 different executables and tried
them for speed on my own Ultra-5; 270Mhz CPU, 256Kb L2 cache, and 
128Mb RAM. On five I used profiling, I ran the first one with
512K FFT and 640K FFT before recompiling, all the others were run
with the 224K FFT and recompiled. The compiler options used were:-

options1 = -fast -xO5 -xsafe=mem -xprefetch -xtarget=native 
-xarch=v8plusa -xchip=ultra2i -xprofile=collect:Mlucas

options2 = -fast -xtarget=native -xarch=v8plusa -xprofile=collect:Mlucas

On recompiling the collect is changed to use, the feedback files
are removed between compiles with different options.

options3 = -fast -xtarget=native -xarch=v8plusa

1) General purpose Mlucas compiled in 64-bit mode with options1, v9a used in
   place of v8plusa.
2) Code commented out as per above instructions, 32-bit mode, with options1.
3) General purpose Mlucas, 32-bit mode, with options1 plus -xspace.
4) Unused code deleted from source file, 32-bit mode, with options1.
5) General purpose Mlucas, 32-bit mode, with options2.
6) General purpose Mlucas, options3.

Executable no.    Size (bytes)   Secs/iter on 224K FFT
1                 416792         0.291
2                 349704         0.265
3                 314416         0.304
4                 215480         0.281
5                 348808         0.276
6                 340064         0.327

The compiler manual says about the -fast option "This option provides
close to the maximum performance for many realistic applications".
It looks like they're right about that, provided you use profiling
and recompile, with only (2) running faster.

With -fast the -xtarget and -xchip options are set, they're redundant.

For comparision MLU does 0.276 secs/iter on a 256K FFT.

I also ran my executable (1) on a 512K FFT and it did 0.70 sec/iter.
That's a big gain on the generic executable's 0.78 secs/iter I reported
earlier, but still a long way from MLU's 0.58 secs/iter.

To get the profiling to work I added a lot more swap space. As you may
recall the compiler was running out of address space and dieing.
It takes close to 2 hours to build an executable with profiling,
run it, then recompile using the results of the profiling. The 
compiler grows to take over 240Mb of address space while optimizing
and the disk rattles continuously as the system pages.

Bill Rea, Information Technology Services, University of Canterbury  \_ 
E-Mail b dot rea at its dot canterbury dot ac dot nz                 </   New 
Phone   64-3-364-2331, Fax     64-3-364-2332                        /)  Zealand 
Unix Systems Administrator                                         (/' 
 



_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

Reply via email to