I think this could be a memory-related issue, but I'm not sure about it. I reduced the number of callbacks 1000 times and rerun the benchmark. Based on an estimate of 100 callbacks per second, it should have taken about 80 seconds. But this time the pathological case did not happen. The benchmark completed in a fraction of a second.But there's still something wrong with this theory: When I looked at the memory statistics during the benchmark with the original high number of callbacks, there was *no* evidence of memory pressure:Is the pointer given to each callback: static void callback(uint32_t *counter) { atomic_dec_32(counter); } on its own cache line or do you have everyone hammering the same cache line?
There are currently 8 32-bit counters in an array, so they can all fit into one cache line. I know this could be much less efficient than a single-threaded algorithm. But I still think 100 per second is at least two orders of magnitude worse than a worst-case estimate. I will definitely try to add 60 bytes of space between the counters to see what happens.These are some diagnostics based on the advice from people on this mailing list:
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 34 589 101 1151 59 346 5928 0 279 0 93 0 7 1 0 0 119 320 13 2419 39 533 5021 0 0 0 85 0 15 2 0 0 577 215 10 3082 30 479 3830 0 0 0 86 0 14 3 0 0 618 297 78 3635 50 703 4458 0 22 0 90 0 10 4 0 0 30 316 5 1189 56 572 4879 0 0 0 91 0 9 5 0 0 50 319 8 2864 42 602 4637 0 0 0 88 0 12 6 0 0 45 359 10 4754 57 550 5226 0 0 0 88 0 12 7 0 0 22 337 17 773 30 466 6267 0 0 0 87 0 13 Sometimes mpstat showed more than 30000 mutex sleep events per second. But the above corresponds to the average values. Now these are the last lines from dtrace -n 'profile-200us...@[cpu, arg0] = count();} END{printa("%u:%a...@u\n", @);}' 0:unix`mutex_delay_default+0xa 31036 1:unix`i86_mwait+0xd 31638 7:unix`mutex_delay_default+0x7 33559 6:unix`mutex_delay_default+0x7 34054 3:unix`mutex_delay_default+0x7 34732 1:unix`mutex_delay_default+0x7 35052 2:unix`mutex_delay_default+0x7 35969 5:unix`mutex_delay_default+0x7 36644 4:unix`mutex_delay_default+0x7 37903 0:unix`mutex_delay_default+0x7 43368 And these are the hottest stacks that cause the mutex issues: dtrace -n mutex_vector_enter:entry'{...@[stack()]=count()} genunix`taskq_bucket_dispatch+0x5d genunix`taskq_dispatch+0x198 bench`bench_push_callbacks+0xdb bench`bench_thread+0x30a unix`thread_start+0x8 11549 genunix`cv_timedwait_hires+0x10b genunix`cv_reltimedwait+0x6c genunix`taskq_thread_wait+0x84 genunix`taskq_d_thread+0x1b7 unix`thread_start+0x8 19231 Interestingly, most threads from dynamic task queues are *not* in the SDC class, even though the taskq was instantiated using taskq_create_sysdc(). Furthermore, the priority of those threads is implicitly set to minclsyspri, which is too low to compete with the benchmarking threads in the SDC class (that run at maxclsyspri most of the time). In the current implementation, only LWPs can enter the SDC class. But creating entire LWPs in the taskq dynamically would probably cause some latency issues and that's why bare threads are used. So I tried taskq_create_proc() instead of taskq_create_sysdc(). The former supports setting the priority explicitly. But there was still a lot of contention, even when running the queue under maxclsyspri... Once too many callbacks accumulate and the buckets get contended, all hope is gone. ;-) Well, I have just learned that task queues were not suitable for what I was trying to do. CPU-bound threads that batch the callbacks together perform three orders of magnitude better. (And that's fine, as long as the callbacks need not sleep.) Many thanks to everyone for your help and advice. Andrej
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ on-discuss mailing list on-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/on-discuss