Hi -
So I noticed that context switches are slower than they should be on
Akaros on my devbox. As in 10x slower than about two years ago on
older hardware.
These should be around 200 ns (various slow stuff is enabled, that's
nowhere near our best), but not 2k. Let's run it under perf stat.
bash-4.3$ perf stat pthread_test 1 1000000 1
Making 1 threads of 1000000 loops each, on 1 vcore(s), 0 work
Done: 1 uthreads, 1000000 loops, 1 vcores, 0 work
Nr context switches: 1000000
Time to run: 2025324 usec
Context switch latency: 2025 nsec
Context switches / sec: 493748
Performance counter stats for 'pthread_test 1 1000000 1':
2009707 cache-misses # 93.86% of all refs
2141285 cache-references # 1.043 M/sec
2009113 branch-misses # 1.03% of all branches
194688210 branches # 94.873 M/sec
1143448236 instructions # 0.174 insns per cycle
6563374910 cycles # 3.198 G/sec
2.052097000 seconds time elapsed
Whoa, look at those cache misses! It looks like there are two cache
misses per loop! (this is confirmed with changing the number of loops).
I'll spare you the long story. I suspected that "LOCK add" that i
mentioned before during pop_user_ctx() (it popped up on perf record
for vmexit speed tests). Basically, we need a memory barrier. Two
years ago, doing this:
"lock addq $0, (%%rdi);" /* LOCK is a CPU mb() */
was faster than an mfence, and the memory that %rdi points to was just
written to in the previous instruction.
Supposedly that was fast, and it was on my old server box at Berkeley
(c89) and also on my old i7 desktop.
I replaced the lock addq with an mfence:
$ perf stat pthread_test 1 1000000 1
Making 1 threads of 1000000 loops each, on 1 vcore(s), 0 work
Done: 1 uthreads, 1000000 loops, 1 vcores, 0 work
Nr context switches: 1000000
Time to run: 230671 usec
Context switch latency: 230 nsec
Context switches / sec: 4335178
Performance counter stats for 'pthread_test 1 1000000 1':
13681 cache-misses # 9.49% of all refs
144189 cache-references # 545.241 K/sec
2008320 branch-misses # 1.03% of all branches
194618571 branches # 735.937 M/sec
1143070866 instructions # 1.397 insns per cycle
818485699 cycles # 3.095 G/sec
0.264450000 seconds time elapsed
So wtf is that? I'd buy it if the LOCK was just slower than MFENCE,
but it also triggers two cache misses? For a line that should have
been modified in the cache anyways? This is all single core btw.
Why would the lock addq trigger excessive cache misses? One line above
that, we did this:
movb $0x00, (%%rdi);
So the cache line should've been present...
But wait a second. Let's look at the full code again:
"movb $0x00, (%2); " /* enable notifications */
/* Need a wrmb() here so the write of enable_notif can't pass
* the read of notif_pending (racing with a potential
* cross-core call with proc_notify()). */
"lock addq $0, (%2); " /* LOCK is a CPU mb() */
/* From here down, we can get interrupted and restarted */
"testb $0x01, (%3); " /* test if a notif is pending */
The mov was a movb, just a byte. The lock add was an addq, 8 bytes.
Perhaps we're crossing cache lines with that operation! Here's the
relevant parts of struct preempt_data:
_Bool notif_disabled; /* 3196 1 */
_Bool notif_pending; /* 3197 1 */
/* XXX 2 bytes hole, try to pack */
/* --- cacheline 50 boundary (3200 bytes) --- */
where %rdi is pointing to notif_disabled. 8 bytes from there covers
from [3196, 3204), which crosses a cache line. Doh! Just to double
check (and that it's not on a page boundary at runtime), let's see what
we're dealing with in pthread_test (debugging printf I added):
I'm on vcore 0, my notif_disabled addr is 0x7f7fffa01e7c
OK, at least it's not on a page boundary, but it is 4 bytes below a
cache line.
Let's change the "lock addq" to a "lock addb"
$ perf stat pthread_test 1 1000000 1
Making 1 threads of 1000000 loops each, on 1 vcore(s), 0 work
Done: 1 uthreads, 1000000 loops, 1 vcores, 0 work
Nr context switches: 1000000
Time to run: 216000 usec
Context switch latency: 216 nsec
Context switches / sec: 4629629
Performance counter stats for 'pthread_test 1 1000000 1':
15497 cache-misses # 10.68% of all refs
145109 cache-references # 597.156 K/sec
2008530 branch-misses # 1.03% of all branches
194618536 branches # 800.899 M/sec
1143072783 instructions # 1.478 insns per cycle
773454209 cycles # 3.183 G/sec
0.243000000 seconds time elapsed
That's more like it. Hardly any cache misses, and a fast context
switch.
My guess is this has been broken since I added the VM contexts a while
back, or maybe the floating point state changes, since those changed the
size of struct preempt_data and moved notif_disabled closer to a
cacheline boundary.
Also, note that the LOCK does better than MFENCE, as I had previously
seen. In fact, if I remove the LOCK entirely, I don't notice a
performance hit at all, though it's probably still there, just not seen.
Here's the final tally:
No operation at all: 216 ns
lock addb: 216 ns
mfence: 230 ns
lock addq: 2025 ns
And also note this is with the pthread scheduler doing dumb things.
Using a user-level switch-to (unmodified or tuned since 2014), we do
quite well still (with the lock addb):
$ pthread_switch 1000000
Making 2 threads of 1000000 switches each
Done: 1000000 loops
Nr context switches: 2000000
Time to run: 96000 usec
Context switch latency: 48 nsec
Context switches / sec: 20833333
(on this one, the lock addb cost us a ns, from 47 to 48).
Anyway, mystery solved!
Barret
--
You received this message because you are subscribed to the Google Groups
"Akaros" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
For more options, visit https://groups.google.com/d/optout.