I think this could be a memory-related issue, but I'm not sure about
it.
I reduced the number of callbacks 1000 times and rerun the benchmark.
Based on an estimate of 100 callbacks per second, it should have taken
about 80 seconds. But this time the pathological case did not happen.
The benchmark completed in a fraction of a second.

But there's still something wrong with this theory: When I looked at
the
memory statistics during the benchmark with the original high number of
callbacks, there was *no* evidence of memory pressure:

Is the pointer given to each callback:

static void
callback(uint32_t *counter) {
atomic_dec_32(counter);
}

on its own cache line or do you have everyone hammering
the same cache line?

There are currently 8 32-bit counters in an array, so they can all fit
into one cache line.
I believe this is what Martin was referring to :
Look at how arrays of mutexes are defined in Solaris - they are
explicitly padded to
avoid cache line contention when 2cpus are working on adjacent mutexes
simultaneously.
In your case, I would expect lot more contention as 8cpus are contending
for the same
line.
Modify the array to be an array of structures with an union element,
first one of which
is the counter and 2nd a char[] of cacheline size - and you should see
some improvement
whether you access it from kernel or user land program.
-Surya

Yes, I tried this today, but there was no real improvement. As DTrace 
measurements show, the mutex contention related to the task queue consumes so 
much time that other problems (such as cache lines) are negligible. There were 
about 140 tasks executed per second (compared to 110 tasks before the change), 
which could be either a slight improvement or just noise... AFAIK, cache line 
contention can become obvious under (at least) millions of oprations per 
second, whereas the self-contended queue executes no more than hundreds of them 
per second.

When I ran the same workload (24 million atomic decrements) using a different 
mechanism I implemented (CPU-bound threads that execute callbacks in batches 
and need no heavy synchronization with callback producers), the whole process 
took a tiny fraction of a second, no matter if the same cache line was used or 
not.

Task queues don't seem to be suitable for running a huge number of short tasks 
that come in bursts. A big burst of tasks almost stops the whole task queue. 
DTrace shows that the situation is very similar to a livelock. Everyone spins a 
lot, trying to access the backing queue or extend a dynamic queue bucket. There 
is some progress, but as John Martin already noted, the progress might be 
related to clock ticks and other (more or less) random events. (That's why the 
observed frequency of task execution is so close to the clock tick frequency.)

If the task queue required its users to pass in a pointer to a data structure 
with room for list pointers, it could operate in a parallel and non-blocking 
fashion, with no backing queue at all. The current implementation takes a 
different approach, which is very comfortable in the sense that you need not 
allocate any data before calling taskq_dispatch(). However, the contention I 
observe is the obvious downside of this API. If something like this can happen 
on 8 CPUs, it's hard to imagine what could happen on a SPARC T2.

Andrej

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
on-discuss mailing list
on-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/on-discuss

Reply via email to