Re: [on-discuss] Catastrophic task queue performance

Andrej Podzimek Tue, 13 Jul 2010 20:09:41 -0700

Hello,

I ran into some problems with dynamic task queue performance. So I ran a
benchmark inside the kernel. It creates a task queue like this:


taskq_create_sysdc(
"blabla", /* name */
512, /* nthreads */
72, /* minalloc */
INT_MAX, /* maxalloc */
my_kernel_process, /* proc */
80, /* dc */
TASKQ_DYNAMIC | TASKQ_PREPOPULATE /* flags */
);

The benchmark starts 8 kernel LWPs. (BTW, it runs on an 8-thread Intel
Core i7.) Each of these LWPs enqueues a million callbacks like this:

static void
callback(uint32_t *counter) {
atomic_dec_32(counter);
}

There are multiple counters and pointers to those counters are
distributed evenly among the callbacks.

Here comes a simple thought:
* Let's assume one CPU can run one billion instructions per second.
* Let's assume one callback (with all the overhead) could cost ten
thousand instructions.
* Then each CPU could process 100000 callbacks per second on an
otherwise idle system...

Now the reality:
* I booted onnv_144 (DEBUG kernel), started the benchmark and thought it
would take just seconds.
* After 10 *minutes*, I started mdb to see what was going on. :-(
* All the 8 benchmarking LWPs were *sleeping* in taskq_dispatch().
* All the taskq threads I looked at were sleeping as well, at least at
the moment of observation.
* By looking at the tq_executed counter periodically, I found out that
only about 110 tasks ran per second.
* The CPU was spending 90% of time in the kernel, which doesn't look
like deep sleeping.
* The counters were decremented as expected, but it took ages...

What could be wrong? Where is the bottleneck?

On one hand, most task queue threads appear to be sleeping, waiting for
a job. The 8 threads producing the callbacks also appear to be sleeping.
On the other hand, all CPUs spend more than 90% of time in the kernel...

There must be a bottleneck somewhere. What could I try? Fewer task queue
threads? Or a non-DEBUG kernel? How could this be diagnosed on a running
system? (I can provide a full 'halt -d' dump.)

I know that task queues are not designed for this type of "workload".
But performing 110 atomic decrements per second on an 8-thread Nehalem
CPU is just far below what I would expect.

Any thoughts or hints would be very helpful. :-)

Andrej


With problems like this, I usually use DTrace to start figuring things
out... remember that you're using the System Dispatch scheduling Class
for this... the comment in

http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/disp/sysdc.c#53
is worth reviewing...

Yes, I've already read that comment a number of times. :-) I use a kernel process that 
lives completely in the System Duty Cycle class (and includes a task queue) for most of 
my benchmarks and experiments. It gives me much more flexibility in situations when some 
(if not all) of the "experimental" threads/LWPs hang or spin indefinitely, 
since the whole userspace remains fully usable.

Furthermore, ZFS uses the System Duty Cycle task queues as well. And AFAIK, ZFS 
*works* just fine. If it had a hard limit of cca 100 callbacks per second, it 
would be a disaster as far as performance is concerned.

I read through the task queue source code, looking for a throttling mechanism, 
but there is probably nothing of that kind. Task queues have fault injection in 
debugging kernels, but that's nothing unexpected and nothing that could cause 
such a poor performance.

It may be possible that enqueueing millions of tasks too quickly triggers 
memory allocation throttling. But even if that happened, it would *only* slow 
down the dispatch side code, not the operation of the whole task queue... So 
this problem still remains a mystery from my point of view.

Andrej

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
on-discuss mailing list
on-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/on-discuss

Re: [on-discuss] Catastrophic task queue performance

Reply via email to