I think this could be a memory-related issue, but I'm not sure about it. I reduced the number of callbacks 1000 times and rerun the benchmark. Based on an estimate of 100 callbacks per second, it should have taken about 80 seconds. But this time the pathological case did not happen. The benchmark completed in a fraction of a second.But there's still something wrong with this theory: When I looked at the memory statistics during the benchmark with the original high number of callbacks, there was *no* evidence of memory pressure:Is the pointer given to each callback: static void callback(uint32_t *counter) { atomic_dec_32(counter); } on its own cache line or do you have everyone hammering the same cache line?There are currently 8 32-bit counters in an array, so they can all fit into one cache line.I believe this is what Martin was referring to : Look at how arrays of mutexes are defined in Solaris - they are explicitly padded to avoid cache line contention when 2cpus are working on adjacent mutexes simultaneously. In your case, I would expect lot more contention as 8cpus are contending for the same line. Modify the array to be an array of structures with an union element, first one of which is the counter and 2nd a char[] of cacheline size - and you should see some improvement whether you access it from kernel or user land program. -Surya
Yes, I tried this today, but there was no real improvement. As DTrace measurements show, the mutex contention related to the task queue consumes so much time that other problems (such as cache lines) are negligible. There were about 140 tasks executed per second (compared to 110 tasks before the change), which could be either a slight improvement or just noise... AFAIK, cache line contention can become obvious under (at least) millions of oprations per second, whereas the self-contended queue executes no more than hundreds of them per second. When I ran the same workload (24 million atomic decrements) using a different mechanism I implemented (CPU-bound threads that execute callbacks in batches and need no heavy synchronization with callback producers), the whole process took a tiny fraction of a second, no matter if the same cache line was used or not. Task queues don't seem to be suitable for running a huge number of short tasks that come in bursts. A big burst of tasks almost stops the whole task queue. DTrace shows that the situation is very similar to a livelock. Everyone spins a lot, trying to access the backing queue or extend a dynamic queue bucket. There is some progress, but as John Martin already noted, the progress might be related to clock ticks and other (more or less) random events. (That's why the observed frequency of task execution is so close to the clock tick frequency.) If the task queue required its users to pass in a pointer to a data structure with room for list pointers, it could operate in a parallel and non-blocking fashion, with no backing queue at all. The current implementation takes a different approach, which is very comfortable in the sense that you need not allocate any data before calling taskq_dispatch(). However, the contention I observe is the obvious downside of this API. If something like this can happen on 8 CPUs, it's hard to imagine what could happen on a SPARC T2. Andrej
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ on-discuss mailing list on-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/on-discuss