------- Comment #4 from scovich at gmail dot com  2009-05-12 16:36 -------
I'm actually running sparcv9-sun-solaris2.10 (the examples used x86 because
more people know it and its asm is easier to read).

My use case is the following: I'm implementing high-performance synchronization
primitives and the compiler isn't generating good enough code -- partly because
it doesn't pipeline spinloops, and partly because it has no way to know what
stuff is truly critical path and what just needs to happen "eventually."

Here's a basic idea of what I've been looking at:

long mcs_lock_acquire(mcs_lock* lock, mcs_qnode* me) {
 again:
    /* initialize qnode, etc */
    membar_producer();
    mcs_qnode* pred = atomic_swap(&lock->tail, me);
    if(pred) {
        pred->next = me;
        while(int flags=me->wait_flags) {
            if(flags & ERROR) {
                /* recovery code */
                goto again;
            }
        }
    }
    membar_enter();
    return (long) pred;
}

This code is absolutely performance-critical because every instruction on the
critical path delays O(N) other threads -- even a single extra load or store
causes noticeable delays. I was trying to rewrite just the while loop above in
asm to be more efficient, but it is hard because of that goto inside.
Basically, the error isn't going anywhere once it shows up, so we don't have to
check it nearly as often as the flags==0 case, and it can be interleaved across
as many loop iterations as needed to make its overhead disappear. Manually
unrolling and pipelining the loop helped a bit, but the compiler still tended
to cluster things together more than was strictly necessary (leading to bursts
of saturated pipeline alternating with slack).

For CC stuff, especially x86-related, I bet places like fftw and gmp are good
sources of frustration to mine.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40124

Reply via email to