------- Comment #4 from scovich at gmail dot com 2009-05-12 16:36 ------- I'm actually running sparcv9-sun-solaris2.10 (the examples used x86 because more people know it and its asm is easier to read).
My use case is the following: I'm implementing high-performance synchronization primitives and the compiler isn't generating good enough code -- partly because it doesn't pipeline spinloops, and partly because it has no way to know what stuff is truly critical path and what just needs to happen "eventually." Here's a basic idea of what I've been looking at: long mcs_lock_acquire(mcs_lock* lock, mcs_qnode* me) { again: /* initialize qnode, etc */ membar_producer(); mcs_qnode* pred = atomic_swap(&lock->tail, me); if(pred) { pred->next = me; while(int flags=me->wait_flags) { if(flags & ERROR) { /* recovery code */ goto again; } } } membar_enter(); return (long) pred; } This code is absolutely performance-critical because every instruction on the critical path delays O(N) other threads -- even a single extra load or store causes noticeable delays. I was trying to rewrite just the while loop above in asm to be more efficient, but it is hard because of that goto inside. Basically, the error isn't going anywhere once it shows up, so we don't have to check it nearly as often as the flags==0 case, and it can be interleaved across as many loop iterations as needed to make its overhead disappear. Manually unrolling and pipelining the loop helped a bit, but the compiler still tended to cluster things together more than was strictly necessary (leading to bursts of saturated pipeline alternating with slack). For CC stuff, especially x86-related, I bet places like fftw and gmp are good sources of frustration to mine. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40124