On 07/14/10 14:25, Andrej Podzimek wrote:
Hello,

I ran into some problems with dynamic task queue performance. So I ran a
benchmark inside the kernel. It creates a task queue like this:

taskq_create_sysdc(
"blabla", /* name */
512, /* nthreads */
72, /* minalloc */
INT_MAX, /* maxalloc */
my_kernel_process, /* proc */
80, /* dc */
TASKQ_DYNAMIC | TASKQ_PREPOPULATE /* flags */
);

The benchmark starts 8 kernel LWPs. (BTW, it runs on an 8-thread Intel
Core i7.) Each of these LWPs enqueues a million callbacks like this:

static void
callback(uint32_t *counter) {
atomic_dec_32(counter);
}

There are multiple counters and pointers to those counters are
distributed evenly among the callbacks.

Here comes a simple thought:
* Let's assume one CPU can run one billion instructions per second.
* Let's assume one callback (with all the overhead) could cost ten
thousand instructions.
* Then each CPU could process 100000 callbacks per second on an
otherwise idle system...

Now the reality:
* I booted onnv_144 (DEBUG kernel), started the benchmark and thought it
would take just seconds.
* After 10 *minutes*, I started mdb to see what was going on. :-(
* All the 8 benchmarking LWPs were *sleeping* in taskq_dispatch().
* All the taskq threads I looked at were sleeping as well, at least at
the moment of observation.
* By looking at the tq_executed counter periodically, I found out that
only about 110 tasks ran per second.
* The CPU was spending 90% of time in the kernel, which doesn't look
like deep sleeping.
* The counters were decremented as expected, but it took ages...

What could be wrong? Where is the bottleneck?

On one hand, most task queue threads appear to be sleeping, waiting for
a job. The 8 threads producing the callbacks also appear to be sleeping.
On the other hand, all CPUs spend more than 90% of time in the kernel...

There must be a bottleneck somewhere. What could I try? Fewer task queue
threads? Or a non-DEBUG kernel? How could this be diagnosed on a running
system? (I can provide a full 'halt -d' dump.)

I know that task queues are not designed for this type of "workload".
But performing 110 atomic decrements per second on an 8-thread Nehalem
CPU is just far below what I would expect.

Any thoughts or hints would be very helpful. :-)

I think this could be a memory-related issue,
Why are you on a guessing spree, rather than trying out a few
dtrace/*stat invocations :
- Run 'mpstat 1', before you start your benchmark.
- Wait for the 'sys' to go up and then look at the other columns :
  - Have the mutex spins gone up OR
    Have the xcall's gone up?
     - Then try dtrace invocations like :
                 dtrace -n mutex_vector_enter:entry'{...@[stack()]=count()}
                 dtrace -n xc_serv:entry'{...@[stack()]=count()}'
       to find out which code path is causing it.
As yours is a kernel benchmark, I don't expect syscalls to go up.

If mpstat doesn't show up anything, try vmstat and check whether
swapping activity has gone up [mjf of mpstat o/p also gives a pointer
in this direction] - could the system be looking to create large pages
and hence walking huge lists?

If you still don't see anything, use dtrace profiling to see which pc's
[and later on stacks] are showing up frequently and then drill down
further.

If you still don't see anything, run 'echo '::stacks' | mdb -k >> /tmp/fb'
a few times while your benchmark is running and look at the file
to see which stacks turn out to be interesting.

If the system becomes too slow to issue any of the above cmds while benchmark
is running, boot with kmdb, and once the system boots up, drop to kmdb
prompt and set a break point in loadavg_update, just before starting your
benchmark - system drops to kmdb prompt every 1sec and look at the cpus to
see what they are running or run ::stacks to see whats happening.
-Surya

but I'm not sure about it. I reduced the number of callbacks 1000 times and rerun the benchmark. Based on an estimate of 100 callbacks per second, it should have taken about 80 seconds. But this time the pathological case did not happen. The benchmark completed in a fraction of a second.

But there's still something wrong with this theory: When I looked at the memory statistics during the benchmark with the original high number of callbacks, there was *no* evidence of memory pressure:

::memstat
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                     633467              2474   31%
ZFS File Data               40890               159    2%
Anon                        21287                83    1%
Exec and libs                1259                 4    0%
Page cache                   4564                17    0%
Free (cachelist)             4148                16    0%
Free (freelist)           1370935              5355   66%

Total                     2076550              8111
Physical                  2076549              8111

Those 2.5 gigabytes of kernel memory were allocated by the benchmark, so there is nothing inexplicable. Page allocation can sleep when available memory < throttlefree, but it is obvious that this did not happen. Perhaps the kernel has another memory allocation throttling mechanism unknown to me...

Andrej


_______________________________________________
on-discuss mailing list
on-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/on-discuss

_______________________________________________
on-discuss mailing list
on-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/on-discuss

Reply via email to