On 07/15/10 00:49, Andrej Podzimek wrote:
I think this could be a memory-related issue, but I'm not sure about
it.
I reduced the number of callbacks 1000 times and rerun the benchmark.
Based on an estimate of 100 callbacks per second, it should have taken
about 80 seconds. But this time the pathological case did not happen.
The benchmark completed in a fraction of a second.

But there's still something wrong with this theory: When I looked at
the
memory statistics during the benchmark with the original high
number of
callbacks, there was *no* evidence of memory pressure:

Is the pointer given to each callback:

static void
callback(uint32_t *counter) {
atomic_dec_32(counter);
}

on its own cache line or do you have everyone hammering
the same cache line?

There are currently 8 32-bit counters in an array, so they can all fit
into one cache line.
I believe this is what Martin was referring to :
Look at how arrays of mutexes are defined in Solaris - they are
explicitly padded to
avoid cache line contention when 2cpus are working on adjacent mutexes
simultaneously.
In your case, I would expect lot more contention as 8cpus are contending
for the same
line.
Modify the array to be an array of structures with an union element,
first one of which
is the counter and 2nd a char[] of cacheline size - and you should see
some improvement
whether you access it from kernel or user land program.
-Surya

Yes, I tried this today, but there was no real improvement. As DTrace
measurements show, the mutex contention related to the task queue
consumes so much time that other problems (such as cache lines) are
negligible. There were about 140 tasks executed per second (compared to
110 tasks before the change), which could be either a slight improvement
or just noise... AFAIK, cache line contention can become obvious under
(at least) millions of oprations per second, whereas the self-contended
queue executes no more than hundreds of them per second.

When I ran the same workload (24 million atomic decrements) using a
different mechanism I implemented (CPU-bound threads that execute
callbacks in batches and need no heavy synchronization with callback
producers), the whole process took a tiny fraction of a second, no
matter if the same cache line was used or not.

Task queues don't seem to be suitable for running a huge number of short
tasks that come in bursts. A big burst of tasks almost stops the whole
task queue. DTrace shows that the situation is very similar to a
livelock. Everyone spins a lot, trying to access the backing queue or
extend a dynamic queue bucket. There is some progress, but as John
Martin already noted, the progress might be related to clock ticks and
other (more or less) random events. (That's why the observed frequency
of task execution is so close to the clock tick frequency.)


I don't know why you're having problems, but for efficiency's sake the
amount of work being dispatched to the task queue should be at least
(handwaving wildly) 10-100x more time consuming that the operation of
dispatching itself.

What does

# lockstat -Ikw sleep 10

report when you run it at the same time as your benchmark?

- Bart

--
Bart Smaalders                  Solaris Kernel Performance
bart.smaald...@oracle.com       http://blogs.sun.com/barts
"You will contribute more with mercurial than with thunderbird."
_______________________________________________
on-discuss mailing list
on-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/on-discuss

Reply via email to