At 10:29 PM 3/19/02 +0000, you wrote:

>On Tue, Mar 19, 2002 at 10:03:50PM +0100, Steinar H. Gunderson wrote:
>> Paste from gwnum.c, Prime95 v19:
>> 
>> /* Well.... I implemented the above only to discover I had dreadful */
>> /* performance in pass 1.  How can that be?  The problem is that each  */
>> /* cache line in pass 1 comes from a different 4KB page.  Therefore, */
>> /* pass 1 accessed 128 different pages.  This is a problem because the */
>> /* Pentium chip has only 64 TLBs (translation lookaside buffers) to map */
>> /* logical page addresses into physical addresses.  So we need to shuffle */
>> /* the data further so that pass 1 data is on fewer pages while */
>> /* pass 2 data is spread over more pages. */
>> 
>> So, it might be that due to TLB thrashing, George would have to
>> choose a less efficient memory layout to avoid them, and thus get
>> lower speed overall.
>
>I think 2MB pages will be a win for any memory layout using < 128 MB
>of RAM (64 TLB entries * 2 MB pages).


On x86 machines, 2MB pages only apply to PAE mode (i.e. the weird 
addressing mode that uses 36 bits). When not in PAE mode your choice
for page sizes are 4kB and 4MB.

I think the linux kernel already uses large pages for *kernel* code and
data. Since these are never supposed to be swapped to disk it's okay for
this purpose; you'll never suffer a TLB miss in kernel code.

Another advantage to using large pages is that on x86 machines there 
is a *separate* (8-entry, on Pentiums, I think) lookaside buffer in
addition to the normal TLB, so you actually give yourself more entries
when using both sizes simultaneously. In fact, P3 processors and up 
have a "global TLB" bit in a system register, that if set forces the large
TLB entries *never* to be flushed on a context switch.

Other machines are much more flexible: PA-RISC machines have a TLB where 
every entry can map a configurable number of pages, from 8kb all the way
to 16MB in powers of 2. There's a paper somewhere that showed how HP-UX
manages the allocation of TLB entry size dynamically, to detect when processes
need bigger pages on the fly. They claim huge speedups for all sorts of
applications.

For Prime95, I think that a screwy memory layout is already required to
avoid cache thrashing; with large pages, the big advantage would be that
a larger FFT radix can be used in the earlier passes of an FFT; if that
can reduce the total number of FFT passes from three to two for a large
range of exponents, that would likely be a *big* win. On the other hand,
for the P4 I dimly remember that Prime95 is compute bound anyway, so it may
not make so big a difference at the high end.

Regarding high-performance memory allocation, I've been devloping patches
for the linux kernel that implement page coloring, so that allocated pages
are spread evenly over the cache and you get higher bandwidth for larger
working sets. So far I've only tested the patch on an Alpha but it's arch-
itecture independent. If anyone running a linux machine would like to try the
patch out and seeing if programs like mprime run faster, let me know via
email. I have 2.2 and 2.4 kernel versions.

jasonp
_________________________________________________________________________
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

Reply via email to