I believe this is what Martin was referring to : Look at how arrays of mutexes are defined in Solaris - they are explicitly padded to avoid cache line contention when 2cpus are working on adjacent mutexes simultaneously. In your case, I would expect lot more contention as 8cpus are contending for the same line. Modify the array to be an array of structures with an union element, first one of which is the counter and 2nd a char[] of cacheline size - and you should see some improvement whether you access it from kernel or user land program. -SuryaYes, I tried this today, but there was no real improvement. As DTrace measurements show, the mutex contention related to the task queue consumes so much time that other problems (such as cache lines) are negligible. There were about 140 tasks executed per second (compared to 110 tasks before the change), which could be either a slight improvement or just noise... AFAIK, cache line contention can become obvious under (at least) millions of oprations per second, whereas the self-contended queue executes no more than hundreds of them per second. When I ran the same workload (24 million atomic decrements) using a different mechanism I implemented (CPU-bound threads that execute callbacks in batches and need no heavy synchronization with callback producers), the whole process took a tiny fraction of a second, no matter if the same cache line was used or not. Task queues don't seem to be suitable for running a huge number of short tasks that come in bursts. A big burst of tasks almost stops the whole task queue. DTrace shows that the situation is very similar to a livelock. Everyone spins a lot, trying to access the backing queue or extend a dynamic queue bucket. There is some progress, but as John Martin already noted, the progress might be related to clock ticks and other (more or less) random events. (That's why the observed frequency of task execution is so close to the clock tick frequency.)I don't know why you're having problems, but for efficiency's sake the amount of work being dispatched to the task queue should be at least (handwaving wildly) 10-100x more time consuming that the operation of dispatching itself.
Well, this is exactly why I am having problems. The dispatched tasks are so short that dispatching can easily take (shooting in the dark) 20 times longer than the tasks themselves. Consequently, the task queue threads need to hold their bucket mutexes locked for 95% of their effective running time. This is not (yet) disastrous, since there are per CPU buckets. However, once all buckets get full, most threads (and especially the task producers) start to compete for the global task queue mutex, which is what I observe. They spin all the time and it is just a matter of chance that those 100 tasks per second eventually get dispatched. What I originally tested was a workload with big bursts (== millions) of small tasks that take just a few instructions in most cases, but some of them (perhaps one in a thousand) *might* sleep. Batching tasks works fine (as already mentioned), but then none of the tasks can sleep. So I tried the task queues, but they are obviously not designed for this type of workload. As you say, a task would have to take much more time than the dispatching overhead to keep mutex contention acceptable.
What does # lockstat -Ikw sleep 10 report when you run it at the same time as your benchmark?
Two outputs from lockstat are attached. Andrej
Profiling interrupt: 7808 events in 10.066 seconds (776 events/sec) Count indv cuml rcnt nsec CPU+PIL Hottest Caller ------------------------------------------------------------------------------- 976 12% 12% 0.00 12976 cpu[7] mutex_delay_default 976 12% 25% 0.00 13009 cpu[5] mutex_delay_default 976 12% 38% 0.00 13449 cpu[4] mutex_delay_default 976 12% 50% 0.00 12795 cpu[3] mutex_delay_default 974 12% 62% 0.00 13385 cpu[6] mutex_delay_default 974 12% 75% 0.00 13346 cpu[1] mutex_delay_default 973 12% 87% 0.00 14017 cpu[2] mutex_delay_default 972 12% 100% 0.00 8588 cpu[0] mutex_delay_default 4 0% 100% 0.00 4722 cpu[0]+11 setbackdq 3 0% 100% 0.00 16430 cpu[2]+11 cpu_update_pct 2 0% 100% 0.00 21032 cpu[6]+11 exp_x 2 0% 100% 0.00 13199 cpu[1]+11 setbackdq -------------------------------------------------------------------------------
Profiling interrupt: 7824 events in 10.060 seconds (778 events/sec) Count indv cuml rcnt nsec CPU+PIL Hottest Caller ------------------------------------------------------------------------------- 978 12% 12% 0.00 10432 cpu[0] mutex_delay_default 978 12% 25% 0.00 13171 cpu[4] mutex_delay_default 977 12% 37% 0.00 12878 cpu[1] mutex_delay_default 976 12% 50% 0.00 8457 cpu[2] mutex_delay_default 975 12% 62% 0.00 12393 cpu[5] mutex_delay_default 974 12% 75% 0.00 12735 cpu[3] mutex_delay_default 973 12% 87% 0.00 12817 cpu[6] mutex_delay_default 971 12% 100% 0.00 12091 cpu[7] mutex_delay_default 6 0% 100% 0.00 9106 cpu[7]+11 dispatch_hilevel 3 0% 100% 0.00 10328 cpu[6]+11 sleepq_wakeone_chan 3 0% 100% 0.00 13088 cpu[3]+11 sleepq_insert 2 0% 100% 0.00 2307 cpu[6]+10 av_dispatch_softvect 2 0% 100% 0.00 11009 cpu[5]+11 savectx 2 0% 100% 0.00 1809 cpu[2]+2 cyclic_coverage_hash 1 0% 100% 0.00 1945 cpu[7]+10 av_dispatch_softvect 1 0% 100% 0.00 3013 cpu[5]+2 dispatch_softint 1 0% 100% 0.00 13929 cpu[3]+5 do_splx 1 0% 100% 0.00 3424 cpu[1]+11 turnstile_wakeup -------------------------------------------------------------------------------
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ on-discuss mailing list on-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/on-discuss