[akaros] slow context switches? mystery solved!

Barret Rhoden Tue, 08 Nov 2016 08:04:22 -0800

Hi -

So I noticed that context switches are slower than they should be on
Akaros on my devbox.  As in 10x slower than about two years ago on
older hardware.


These should be around 200 ns (various slow stuff is enabled, that's
nowhere near our best), but not 2k. Let's run it under perf stat.

bash-4.3$ perf stat pthread_test 1 1000000 1

Making 1 threads of 1000000 loops each, on 1 vcore(s), 0 work
Done: 1 uthreads, 1000000 loops, 1 vcores, 0 work
Nr context switches: 1000000
Time to run: 2025324 usec
Context switch latency: 2025 nsec
Context switches / sec: 493748


Performance counter stats for 'pthread_test 1 1000000 1':

      2009707      cache-misses              #   93.86% of all refs
      2141285      cache-references          #    1.043 M/sec
      2009113      branch-misses             #    1.03% of all branches 
    194688210      branches                  #   94.873 M/sec 
   1143448236      instructions              #    0.174 insns per cycle 
   6563374910      cycles                    #    3.198 G/sec

      2.052097000 seconds time elapsed

Whoa, look at those cache misses!  It looks like there are two cache
misses per loop!  (this is confirmed with changing the number of loops).


I'll spare you the long story.  I suspected that "LOCK add" that i
mentioned before during pop_user_ctx() (it popped up on perf record
for vmexit speed tests).  Basically, we need a memory barrier.  Two
years ago, doing this:

        "lock addq $0, (%%rdi);" /* LOCK is a CPU mb() */

was faster than an mfence, and the memory that %rdi points to was just
written to in the previous instruction.

Supposedly that was fast, and it was on my old server box at Berkeley
(c89) and also on my old i7 desktop.  

I replaced the lock addq with an mfence:

$ perf stat pthread_test 1 1000000 1
Making 1 threads of 1000000 loops each, on 1 vcore(s), 0 work
Done: 1 uthreads, 1000000 loops, 1 vcores, 0 work
Nr context switches: 1000000
Time to run: 230671 usec
Context switch latency: 230 nsec
Context switches / sec: 4335178


Performance counter stats for 'pthread_test 1 1000000 1':

             13681      cache-misses              #    9.49% of all refs
            144189      cache-references          #  545.241 K/sec
           2008320      branch-misses             #    1.03% of all branches
         194618571      branches                  #  735.937 M/sec
        1143070866      instructions              #    1.397 insns per cycle
         818485699      cycles                    #    3.095 G/sec

       0.264450000 seconds time elapsed

So wtf is that?  I'd buy it if the LOCK was just slower than MFENCE,
but it also triggers two cache misses?  For a line that should have
been modified in the cache anyways?  This is all single core btw.
Why would the lock addq trigger excessive cache misses?  One line above
that, we did this:
                movb $0x00, (%%rdi);

So the cache line should've been present...

But wait a second.  Let's look at the full code again:

                  "movb $0x00, (%2);     " /* enable notifications */
                  /* Need a wrmb() here so the write of enable_notif can't pass
                   * the read of notif_pending (racing with a potential
                   * cross-core call with proc_notify()). */
                  "lock addq $0, (%2);   " /* LOCK is a CPU mb() */
                  /* From here down, we can get interrupted and restarted */
                  "testb $0x01, (%3);    " /* test if a notif is pending */

The mov was a movb, just a byte.  The lock add was an addq, 8 bytes.
Perhaps we're crossing cache lines with that operation!  Here's the
relevant parts of struct preempt_data:

    _Bool                      notif_disabled;       /*  3196     1 */
    _Bool                      notif_pending;        /*  3197     1 */

    /* XXX 2 bytes hole, try to pack */

    /* --- cacheline 50 boundary (3200 bytes) --- */

where %rdi is pointing to notif_disabled.  8 bytes from there covers
from [3196, 3204), which crosses a cache line.  Doh!  Just to double
check (and that it's not on a page boundary at runtime), let's see what
we're dealing with in pthread_test (debugging printf I added):

        I'm on vcore 0, my notif_disabled addr is 0x7f7fffa01e7c

OK, at least it's not on a page boundary, but it is 4 bytes below a
cache line.

Let's change the "lock addq" to a "lock addb"

$ perf stat pthread_test 1 1000000 1
Making 1 threads of 1000000 loops each, on 1 vcore(s), 0 work
Done: 1 uthreads, 1000000 loops, 1 vcores, 0 work
Nr context switches: 1000000
Time to run: 216000 usec
Context switch latency: 216 nsec
Context switches / sec: 4629629


Performance counter stats for 'pthread_test 1 1000000 1':

             15497      cache-misses              #   10.68% of all refs
            145109      cache-references          #  597.156 K/sec
           2008530      branch-misses             #    1.03% of all branches
         194618536      branches                  #  800.899 M/sec
        1143072783      instructions              #    1.478 insns per cycle
         773454209      cycles                    #    3.183 G/sec

       0.243000000 seconds time elapsed

That's more like it.  Hardly any cache misses, and a fast context
switch.

My guess is this has been broken since I added the VM contexts a while
back, or maybe the floating point state changes, since those changed the
size of struct preempt_data and moved notif_disabled closer to a
cacheline boundary.

Also, note that the LOCK does better than MFENCE, as I had previously
seen.  In fact, if I remove the LOCK entirely, I don't notice a
performance hit at all, though it's probably still there, just not seen.
Here's the final tally:

        No operation at all:  216 ns
        lock addb:            216 ns
        mfence:               230 ns
        lock addq:           2025 ns

And also note this is with the pthread scheduler doing dumb things.
Using a user-level switch-to (unmodified or tuned since 2014), we do
quite well still (with the lock addb):

$ pthread_switch 1000000
Making 2 threads of 1000000 switches each
Done: 1000000 loops
Nr context switches: 2000000
Time to run: 96000 usec
Context switch latency: 48 nsec
Context switches / sec: 20833333

(on this one, the lock addb cost us a ns, from 47 to 48).

Anyway, mystery solved!

Barret

-- 
You received this message because you are subscribed to the Google Groups 
"Akaros" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
For more options, visit https://groups.google.com/d/optout.

[akaros] slow context switches? mystery solved!

Reply via email to