>Tony Reix tried to reply but was blocked. His reply follows:
>Hello, > >Here are some information about experiments I've made with >GLucas (PThreads) on Itanium2 and PowerPC multi-CPUs machines. >I've also discussed with Ernst Mayer about MLucas (but I never >found the time to play very long with his code ...). > >I've discovered that choices made by Guillermo and Ernst >prevented their program to be scalable on large (8, 16 and >more CPUs) machines. > >It would be sad to have Prime95 usable only on small >multi-CPUs machines. In 5 years from now, probably we will buy >16 cores machines for less than 1000 $. > > From a programmer point of view, multi-cores or multi-CPU >machines seem pure SMP (time to access memory is the same, >whatever the number of cores, and whatever which core accesses >which bank of memory). > >In fact, all SMP machines are NUMA based, and memory banks are >often placed close to some cores/CPUs and thus far or very far >from other cores/CPUs. This is called the NUMA factor (Non >Uniform Memory Access). >On large machines, there are sometimes several NUMA factors. > >My experiments shown that NUMA factor on a 16x PowerPC IBM >machine was much lower than on our old 16x Itanium2 Bull (I >haven't played with the new ones ...). > >On actual dual-core PCs, probably there is no NUMA factor. >But on 4-cores machines built with recent Intel Woodcrest >CPUs, I've seen that the performance decreases when 4 Prime95 >program were running compared to only one Prime95. Also, >performance decrease was higher and higher with larger FFTs. >Colleagues saw same kind of scalability problem with other HPC >C or Fortran programs. > >With NUMA machines (and large multi-CPUs machines are NUMA), I >think that one reason why GLucas is not scalable is due to the >fact that memory is allocated BEFORE threads are created. Thus >the memory is allocated on a bank of memory close to the first >thread. Further threads needing to read or write a part of the >FFT data have to cross one or several buses. This implies that >some threads are waiting much more time when asking for data. > >When I ran the verification of M44 on our old 16x Itanium2 >machine, which had one big NUMA factor, I've seen that the 16 >processors were used only 75 % . Not so good. >Two possible reasons for that: >a) the NUMA factor (threads are waiting for memory). >b) a bottleneck: data chewed by threads must be assembled by >one thread at each iteration. Using our PTT visualizer of >threads, I've seen this. >c) some threads are late and thus the thread assembling the >work of the different threads must wait. >Probably the 25 % idle was due to a mix of a) b) and c). > >One can imagine that the Operating System could move data to >memory banks close to the thread which is using it. Colleagues >at Bull worked on something like that. But, when several >threads are using the same range of memory, that does not work. > >Lessons: >a) Chunks of Memory data should be allocated by the threads >that use it most of the time, and not as one big array. >b) The single thread bottleneck must be reduced or ... //ed. > >I know very few about programming a FFT. But I guess they have >been built the easy way: you have one big array allocated at >beginning. //ing this FFT with threads means that each threads >has to work on a part of the array. Here is the problem. >I do not know if it is possible, but one should try to >allocate diffent chunks of memory, each allocated by the >thread which uses it. When some work must be done on all the >chunks of data seen as one big array, one could consider >copying data in another big array or //ing this task (easier >to say that than to do it, I know nearly nothing about FFT !). > >Last problem, and George talked about that: if one or several >threads are late, then the main thread waits for them and thus >idle time appears. >I think this appears when CPUs are used from time to time for >other small and regular tasks (crons, daemons, ...) or because >the O.S. moves threads from CPUs to other CPUs (itshould not, >but it often makes mistakes, you know !). This can be fixed by >binding each FFT thread on a CPU. Thus, on a 8 core machine, >it would be nice to bind Prime5 on 6 cores, and let 2 cores >free for other tasks and the OS. (Does the "nice" creates >problems ?) So, Prime95 threads would stay on the same CPUs >and the FFT would be slowed down only when the end-user >launches so much higher priority tasks that require more than >2 CPUs. And this would require the FFT program to divide the >work in not-powers-of-2. > >Hope it helps, _______________________________________________ Prime mailing list [email protected] http://hogranch.com/mailman/listinfo/prime
