I think this could be a memory-related issue, but I'm not sure about it.
I reduced the number of callbacks 1000 times and rerun the benchmark.
Based on an estimate of 100 callbacks per second, it should have taken
about 80 seconds. But this time the pathological case did not happen.
The benchmark completed in a fraction of a second.

But there's still something wrong with this theory: When I looked at the
memory statistics during the benchmark with the original high number of
callbacks, there was *no* evidence of memory pressure:

Is the pointer given to each callback:

static void
callback(uint32_t *counter) {
atomic_dec_32(counter);
}

on its own cache line or do you have everyone hammering
the same cache line?

There are currently 8 32-bit counters in an array, so they can all fit into one 
cache line. I know this could be much less efficient than a single-threaded 
algorithm. But I still think 100 per second is at least two orders of magnitude 
worse than a worst-case estimate. I will definitely try to add 60 bytes of 
space between the counters to see what happens.
These are some diagnostics based on the advice from people on this mailing list:

        CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  
wt idl
          0    0   0   34   589  101 1151   59  346 5928    0   279    0  93   
0   7
          1    0   0  119   320   13 2419   39  533 5021    0     0    0  85   
0  15
          2    0   0  577   215   10 3082   30  479 3830    0     0    0  86   
0  14
          3    0   0  618   297   78 3635   50  703 4458    0    22    0  90   
0  10
          4    0   0   30   316    5 1189   56  572 4879    0     0    0  91   
0   9
          5    0   0   50   319    8 2864   42  602 4637    0     0    0  88   
0  12
          6    0   0   45   359   10 4754   57  550 5226    0     0    0  88   
0  12
          7    0   0   22   337   17  773   30  466 6267    0     0    0  87   
0  13

Sometimes mpstat showed more than 30000 mutex sleep events per second. But the 
above corresponds to the average values. Now these are the last lines from

        dtrace -n 'profile-200us...@[cpu, arg0] = count();} 
END{printa("%u:%a...@u\n", @);}'

        0:unix`mutex_delay_default+0xa  31036
        1:unix`i86_mwait+0xd    31638
        7:unix`mutex_delay_default+0x7  33559
        6:unix`mutex_delay_default+0x7  34054
        3:unix`mutex_delay_default+0x7  34732
        1:unix`mutex_delay_default+0x7  35052
        2:unix`mutex_delay_default+0x7  35969
        5:unix`mutex_delay_default+0x7  36644
        4:unix`mutex_delay_default+0x7  37903
        0:unix`mutex_delay_default+0x7  43368

And these are the hottest stacks that cause the mutex issues:

        dtrace -n mutex_vector_enter:entry'{...@[stack()]=count()}

                genunix`taskq_bucket_dispatch+0x5d
                genunix`taskq_dispatch+0x198
                bench`bench_push_callbacks+0xdb
                bench`bench_thread+0x30a
                unix`thread_start+0x8
        11549

                genunix`cv_timedwait_hires+0x10b
                genunix`cv_reltimedwait+0x6c
                genunix`taskq_thread_wait+0x84
                genunix`taskq_d_thread+0x1b7
                unix`thread_start+0x8
        19231


Interestingly, most threads from dynamic task queues are *not* in the SDC 
class, even though the taskq was instantiated using taskq_create_sysdc(). 
Furthermore, the priority of those threads is implicitly set to minclsyspri, 
which is too low to compete with the benchmarking threads in the SDC class 
(that run at maxclsyspri most of the time). In the current implementation, only 
LWPs can enter the SDC class. But creating entire LWPs in the taskq dynamically 
would probably cause some latency issues and that's why bare threads are used.

So I tried taskq_create_proc() instead of taskq_create_sysdc(). The former 
supports setting the priority explicitly. But there was still a lot of 
contention, even when running the queue under maxclsyspri... Once too many 
callbacks accumulate and the buckets get contended, all hope is gone. ;-)

Well, I have just learned that task queues were not suitable for what I was 
trying to do. CPU-bound threads that batch the callbacks together perform three 
orders of magnitude better. (And that's fine, as long as the callbacks need not 
sleep.)

Many thanks to everyone for your help and advice.

Andrej

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
on-discuss mailing list
on-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/on-discuss

Reply via email to