On 9/18/20 1:25 PM, Andrew Stubbs wrote: > This patch fixes a problem in which nested OpenMP parallel regions cause > errors if the number of inner teams is not balanced (i.e. the number of > loop iterations is not divisible by the number of physical threads). A > testcase is included. > > On NVPTX the symptom was a fatal error: > > libgomp: cuCtxSynchronize error: an illegal instruction was encountered > > This was caused by mismatched "bar.sync" instructions (one waiting for > 32 threads while another is waiting for 256). The source of the mismatch > being that some threads were still busy while others had run out of work > to do. > > On GCN there was no such error (GCN barriers always wait for all > threads), but it worked only by chance: the idle threads were "matching" > different barriers to the busy threads, but it was harmless because the > thread function pointer remained NULL. > > This patch simply skips barriers when they would "wait" for only one > thread (the current thread). This means that teams nested inside other > teams now run independently, instead of strictly in lock-step, and is > only valid as long as inner teams are limited to one thread each > (currently the case).
Is this inner-team-one-thread-limit coded or documented somewhere? If so, it might be good to add a comment there referring to the code this patch adds. Follow-up patch is OK, thanks. - Tom > When the inner regions exit then the barriers for > the outer region will sync everything up again. > > OK to commit? > > Andrew > > P.S. I can approve the amdgcn portion myself; I'm seeking approval for > the nvptx portion.