I believe this is what Martin was referring to :
Look at how arrays of mutexes are defined in Solaris - they are
explicitly padded to
avoid cache line contention when 2cpus are working on adjacent mutexes
simultaneously.
In your case, I would expect lot more contention as 8cpus are contending
for the same
line.
Modify the array to be an array of structures with an union element,
first one of which
is the counter and 2nd a char[] of cacheline size - and you should see
some improvement
whether you access it from kernel or user land program.
-Surya

Yes, I tried this today, but there was no real improvement. As DTrace
measurements show, the mutex contention related to the task queue
consumes so much time that other problems (such as cache lines) are
negligible. There were about 140 tasks executed per second (compared to
110 tasks before the change), which could be either a slight improvement
or just noise... AFAIK, cache line contention can become obvious under
(at least) millions of oprations per second, whereas the self-contended
queue executes no more than hundreds of them per second.

When I ran the same workload (24 million atomic decrements) using a
different mechanism I implemented (CPU-bound threads that execute
callbacks in batches and need no heavy synchronization with callback
producers), the whole process took a tiny fraction of a second, no
matter if the same cache line was used or not.

Task queues don't seem to be suitable for running a huge number of short
tasks that come in bursts. A big burst of tasks almost stops the whole
task queue. DTrace shows that the situation is very similar to a
livelock. Everyone spins a lot, trying to access the backing queue or
extend a dynamic queue bucket. There is some progress, but as John
Martin already noted, the progress might be related to clock ticks and
other (more or less) random events. (That's why the observed frequency
of task execution is so close to the clock tick frequency.)


I don't know why you're having problems, but for efficiency's sake the
amount of work being dispatched to the task queue should be at least
(handwaving wildly) 10-100x more time consuming that the operation of
dispatching itself.

Well, this is exactly why I am having problems. The dispatched tasks are so 
short that dispatching can easily take (shooting in the dark) 20 times longer 
than the tasks themselves. Consequently, the task queue threads need to hold 
their bucket mutexes locked for 95% of their effective running time. This is 
not (yet) disastrous, since there are per CPU buckets. However, once all 
buckets get full, most threads (and especially the task producers) start to 
compete for the global task queue mutex, which is what I observe. They spin all 
the time and it is just a matter of chance that those 100 tasks per second 
eventually get dispatched.

What I originally tested was a workload with big bursts (== millions) of small 
tasks that take just a few instructions in most cases, but some of them 
(perhaps one in a thousand) *might* sleep. Batching tasks works fine (as 
already mentioned), but then none of the tasks can sleep. So I tried the task 
queues, but they are obviously not designed for this type of workload. As you 
say, a task would have to take much more time than the dispatching overhead to 
keep mutex contention acceptable.

What does

# lockstat -Ikw sleep 10

report when you run it at the same time as your benchmark?

Two outputs from lockstat are attached.

Andrej
Profiling interrupt: 7808 events in 10.066 seconds (776 events/sec)

Count indv cuml rcnt     nsec CPU+PIL                Hottest Caller          
-------------------------------------------------------------------------------
  976  12%  12% 0.00    12976 cpu[7]                 mutex_delay_default     
  976  12%  25% 0.00    13009 cpu[5]                 mutex_delay_default     
  976  12%  38% 0.00    13449 cpu[4]                 mutex_delay_default     
  976  12%  50% 0.00    12795 cpu[3]                 mutex_delay_default     
  974  12%  62% 0.00    13385 cpu[6]                 mutex_delay_default     
  974  12%  75% 0.00    13346 cpu[1]                 mutex_delay_default     
  973  12%  87% 0.00    14017 cpu[2]                 mutex_delay_default     
  972  12% 100% 0.00     8588 cpu[0]                 mutex_delay_default     
    4   0% 100% 0.00     4722 cpu[0]+11              setbackdq               
    3   0% 100% 0.00    16430 cpu[2]+11              cpu_update_pct          
    2   0% 100% 0.00    21032 cpu[6]+11              exp_x                   
    2   0% 100% 0.00    13199 cpu[1]+11              setbackdq               
-------------------------------------------------------------------------------
Profiling interrupt: 7824 events in 10.060 seconds (778 events/sec)

Count indv cuml rcnt     nsec CPU+PIL                Hottest Caller          
-------------------------------------------------------------------------------
  978  12%  12% 0.00    10432 cpu[0]                 mutex_delay_default     
  978  12%  25% 0.00    13171 cpu[4]                 mutex_delay_default     
  977  12%  37% 0.00    12878 cpu[1]                 mutex_delay_default     
  976  12%  50% 0.00     8457 cpu[2]                 mutex_delay_default     
  975  12%  62% 0.00    12393 cpu[5]                 mutex_delay_default     
  974  12%  75% 0.00    12735 cpu[3]                 mutex_delay_default     
  973  12%  87% 0.00    12817 cpu[6]                 mutex_delay_default     
  971  12% 100% 0.00    12091 cpu[7]                 mutex_delay_default     
    6   0% 100% 0.00     9106 cpu[7]+11              dispatch_hilevel        
    3   0% 100% 0.00    10328 cpu[6]+11              sleepq_wakeone_chan     
    3   0% 100% 0.00    13088 cpu[3]+11              sleepq_insert           
    2   0% 100% 0.00     2307 cpu[6]+10              av_dispatch_softvect    
    2   0% 100% 0.00    11009 cpu[5]+11              savectx                 
    2   0% 100% 0.00     1809 cpu[2]+2               cyclic_coverage_hash    
    1   0% 100% 0.00     1945 cpu[7]+10              av_dispatch_softvect    
    1   0% 100% 0.00     3013 cpu[5]+2               dispatch_softint        
    1   0% 100% 0.00    13929 cpu[3]+5               do_splx                 
    1   0% 100% 0.00     3424 cpu[1]+11              turnstile_wakeup        
-------------------------------------------------------------------------------

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
on-discuss mailing list
on-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/on-discuss

Reply via email to