At 10:29 PM 3/19/02 +0000, you wrote: >On Tue, Mar 19, 2002 at 10:03:50PM +0100, Steinar H. Gunderson wrote: >> Paste from gwnum.c, Prime95 v19: >> >> /* Well.... I implemented the above only to discover I had dreadful */ >> /* performance in pass 1. How can that be? The problem is that each */ >> /* cache line in pass 1 comes from a different 4KB page. Therefore, */ >> /* pass 1 accessed 128 different pages. This is a problem because the */ >> /* Pentium chip has only 64 TLBs (translation lookaside buffers) to map */ >> /* logical page addresses into physical addresses. So we need to shuffle */ >> /* the data further so that pass 1 data is on fewer pages while */ >> /* pass 2 data is spread over more pages. */ >> >> So, it might be that due to TLB thrashing, George would have to >> choose a less efficient memory layout to avoid them, and thus get >> lower speed overall. > >I think 2MB pages will be a win for any memory layout using < 128 MB >of RAM (64 TLB entries * 2 MB pages).
On x86 machines, 2MB pages only apply to PAE mode (i.e. the weird addressing mode that uses 36 bits). When not in PAE mode your choice for page sizes are 4kB and 4MB. I think the linux kernel already uses large pages for *kernel* code and data. Since these are never supposed to be swapped to disk it's okay for this purpose; you'll never suffer a TLB miss in kernel code. Another advantage to using large pages is that on x86 machines there is a *separate* (8-entry, on Pentiums, I think) lookaside buffer in addition to the normal TLB, so you actually give yourself more entries when using both sizes simultaneously. In fact, P3 processors and up have a "global TLB" bit in a system register, that if set forces the large TLB entries *never* to be flushed on a context switch. Other machines are much more flexible: PA-RISC machines have a TLB where every entry can map a configurable number of pages, from 8kb all the way to 16MB in powers of 2. There's a paper somewhere that showed how HP-UX manages the allocation of TLB entry size dynamically, to detect when processes need bigger pages on the fly. They claim huge speedups for all sorts of applications. For Prime95, I think that a screwy memory layout is already required to avoid cache thrashing; with large pages, the big advantage would be that a larger FFT radix can be used in the earlier passes of an FFT; if that can reduce the total number of FFT passes from three to two for a large range of exponents, that would likely be a *big* win. On the other hand, for the P4 I dimly remember that Prime95 is compute bound anyway, so it may not make so big a difference at the high end. Regarding high-performance memory allocation, I've been devloping patches for the linux kernel that implement page coloring, so that allocated pages are spread evenly over the cache and you get higher bandwidth for larger working sets. So far I've only tested the patch on an Alpha but it's arch- itecture independent. If anyone running a linux machine would like to try the patch out and seeing if programs like mprime run faster, let me know via email. I have 2.2 and 2.4 kernel versions. jasonp _________________________________________________________________________ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers