On Nov 11, 2010, at 7:15 PM, Jed Brown wrote:

> On Fri, Nov 12, 2010 at 02:03, Barry Smith <bsmith at mcs.anl.gov> wrote:
> > I mean its easy to tell a thread to do something, but I was not aware that 
> > pthreads had nice support
> > for telling all threads to do something at the same time. On a multicore, 
> > you want vector instructions,
> 
>   Why do you want vector instructions on multicore? Since each core has a 
> full instruction stream what do you get by vectorization?
> 
> I agree with Barry that massively parallel vector instructions are not the 
> way to go for multi-core.  Each core has a vector unit (16 bytes today, 32 
> bytes next year (AVX), 64 or 128 likely in a few years, depending on who you 
> believe) which gets you the fine-grained parallelism.
> 
> The main problem with OpenMP across multiple cores is that you don't get any 
> control over data locality.  In reality, especially on a NUMA system (every 
> multi-socket system worth discussing now is NUMA), the location of the 
> physical pages is of critical importance.  You can easily see a factor of 
> more than 3 performance hit on a quad-core system due to physical pages 
> getting mis-mapped.  This is already an issue with separate processes using 
> affinity, but only if the OS is sloppy or the sysadmins are incompetent (I 
> ran into this when a process was leaving some stale ramdisk lying around).
> 
> But it is a way bigger deal for OpenMP where you have no control over how the 
> threads get mapped, but it is absolutely critical that every time you touch 
> some memory, you use a thread bound to the same NUMA node (=socket, usually) 
> as the thread that faulted it (not allocated, that doesn't matter, even if 
> it's statically allocated).  With pthreads, you get a more flexible 
> programming model, and you can organize your memory so that almost all 
> accesses are local.  In exchange for this more explicit control over memory 
> locality (and generally more flexible programming model), you get some added 
> complexity, and it can't just be "annotated" into an existing code to 
> "parallelize" it.  Projecting on someone I've never met, this is likely the 
> primary issue Bill Gropp has with OpenMP, and I think it is entirely valid.

   This is my understanding of what Bill told me. Better explained then I could.

   Barry

> 
> I don't have experience using CUDA to generate CPU code, and I don't know how 
> it performs.  An obvious difference relative to the GPU is that the CPU can 
> perform much less structured computation without a performance hit.  I don't 
> know if CUDA/OpenCL for CPU allows you to take advantage of this.
> 
> Jed


Reply via email to