[Prime] Fw:Re: Multi-threaded carry propagation

George Woltman Thu, 14 Sep 2006 14:45:40 -0700

>Tony Reix tried to reply but was blocked.  His reply follows:


>Hello,
>
>Here are some information about experiments I've made with
>GLucas (PThreads) on Itanium2 and PowerPC multi-CPUs machines.
>I've also discussed with Ernst Mayer about MLucas (but I never
>found the time to play very long with his code ...).
>
>I've discovered that choices made by Guillermo and Ernst
>prevented  their program to be scalable on large (8, 16 and
>more CPUs) machines.
>
>It would be sad to have Prime95 usable only on small
>multi-CPUs machines. In 5 years from now, probably we will buy
>16 cores machines for less than 1000 $.
>
> From a programmer point of view, multi-cores or multi-CPU
>machines seem pure SMP (time to access memory is the same,
>whatever the number of cores, and whatever which core accesses
>which bank of memory).
>
>In fact, all SMP machines are NUMA based, and memory banks are
>often placed close to some cores/CPUs and thus far or very far
>from other cores/CPUs. This is called the NUMA factor (Non
>Uniform Memory Access).
>On large machines, there are sometimes several NUMA factors.
>
>My experiments shown that NUMA factor on a 16x PowerPC IBM
>machine was much lower than on our old 16x Itanium2 Bull (I
>haven't played with the new ones ...).
>
>On actual dual-core PCs, probably there is no NUMA factor.
>But on 4-cores machines built with recent Intel Woodcrest
>CPUs, I've seen that the performance decreases when 4 Prime95
>program were running compared to only one Prime95. Also,
>performance decrease was higher and higher with larger FFTs.
>Colleagues saw same kind of scalability problem with other HPC
>C or Fortran programs.
>
>With NUMA machines (and large multi-CPUs machines are NUMA), I
>think that one reason why GLucas is not scalable is due to the
>fact that memory is allocated BEFORE threads are created. Thus
>the memory is allocated on a bank of memory close to the first
>thread. Further threads needing to read or write a part of the
>FFT data have to cross one or several buses. This implies that
>some threads are waiting much more time when asking for data.
>
>When I ran the verification of M44 on our old 16x Itanium2
>machine, which had one big NUMA factor, I've seen that the 16
>processors were used only 75 % . Not so good.
>Two possible reasons for that:
>a) the NUMA factor (threads are waiting for memory).
>b) a bottleneck: data chewed by threads must be assembled by
>one thread at each iteration. Using our PTT visualizer of
>threads, I've seen this.
>c) some threads are late and thus the thread assembling the
>work of the different threads must wait.
>Probably the 25 % idle was due to a mix of a) b) and c).
>
>One can imagine that the Operating System could move data to
>memory banks close to the thread which is using it. Colleagues
>at Bull worked on something like that. But, when several
>threads are using the same range of memory, that does not work.
>
>Lessons:
>a) Chunks of Memory data should be allocated by the threads
>that use it most of the time, and not as one big array.
>b) The single thread bottleneck must be reduced or ... //ed.
>
>I know very few about programming a FFT. But I guess they have
>been built the easy way: you have one big array allocated at
>beginning. //ing this FFT with threads means that each threads
>has to work on a part of the array. Here is the problem.
>I do not know if it is possible, but one should try to
>allocate diffent chunks of memory, each allocated by the
>thread which uses it. When some work must be done on all the
>chunks of data seen as one big array, one could consider
>copying data in another big array or //ing this task (easier
>to say that than to do it, I know nearly nothing about FFT !).
>
>Last problem, and George talked about that: if one or several
>threads are late, then the main thread waits for them and thus
>idle time appears.
>I think this appears when CPUs are used from time to time for
>other small and regular tasks (crons, daemons, ...) or because
>the O.S. moves threads from CPUs to other CPUs (itshould not,
>but it often makes mistakes, you know !). This can be fixed by
>binding each FFT thread on a CPU. Thus, on a 8 core machine,
>it would be nice to bind Prime5 on 6 cores, and let 2 cores
>free for other tasks and the OS. (Does the "nice" creates
>problems ?) So, Prime95 threads would stay on the same CPUs
>and the FFT would be slowed down only when the end-user
>launches so much higher priority tasks that require more than
>2 CPUs. And this would require the FFT program to divide the
>work in not-powers-of-2.
>
>Hope it helps,

_______________________________________________
Prime mailing list
[email protected]
http://hogranch.com/mailman/listinfo/prime

[Prime] Fw:Re: Multi-threaded carry propagation

Reply via email to