On Nov 11, 2010, at 7:15 PM, Jed Brown wrote: > On Fri, Nov 12, 2010 at 02:03, Barry Smith <bsmith at mcs.anl.gov> wrote: > > I mean its easy to tell a thread to do something, but I was not aware that > > pthreads had nice support > > for telling all threads to do something at the same time. On a multicore, > > you want vector instructions, > > Why do you want vector instructions on multicore? Since each core has a > full instruction stream what do you get by vectorization? > > I agree with Barry that massively parallel vector instructions are not the > way to go for multi-core. Each core has a vector unit (16 bytes today, 32 > bytes next year (AVX), 64 or 128 likely in a few years, depending on who you > believe) which gets you the fine-grained parallelism. > > The main problem with OpenMP across multiple cores is that you don't get any > control over data locality. In reality, especially on a NUMA system (every > multi-socket system worth discussing now is NUMA), the location of the > physical pages is of critical importance. You can easily see a factor of > more than 3 performance hit on a quad-core system due to physical pages > getting mis-mapped. This is already an issue with separate processes using > affinity, but only if the OS is sloppy or the sysadmins are incompetent (I > ran into this when a process was leaving some stale ramdisk lying around). > > But it is a way bigger deal for OpenMP where you have no control over how the > threads get mapped, but it is absolutely critical that every time you touch > some memory, you use a thread bound to the same NUMA node (=socket, usually) > as the thread that faulted it (not allocated, that doesn't matter, even if > it's statically allocated). With pthreads, you get a more flexible > programming model, and you can organize your memory so that almost all > accesses are local. In exchange for this more explicit control over memory > locality (and generally more flexible programming model), you get some added > complexity, and it can't just be "annotated" into an existing code to > "parallelize" it. Projecting on someone I've never met, this is likely the > primary issue Bill Gropp has with OpenMP, and I think it is entirely valid.
This is my understanding of what Bill told me. Better explained then I could. Barry > > I don't have experience using CUDA to generate CPU code, and I don't know how > it performs. An obvious difference relative to the GPU is that the CPU can > perform much less structured computation without a performance hit. I don't > know if CUDA/OpenCL for CPU allows you to take advantage of this. > > Jed
