Oh, sorry, I didn't answer your question about the latency observation with prstat. Not high, just higher than expected, approaching 1% which is of course higher than what I obtain when I coerce the situation by binding.
> -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of > Eric C. Saxe > Sent: Thursday, September 01, 2005 1:48 AM > To: perf-discuss@opensolaris.org > Subject: [perf-discuss] Re: Puzzling scheduler behavior > > Hi David, > > Since your v1280 systems has NUMA characteristics, the bias > that you see for one of the boards may be a result of the > kernel trying to run your application's threads "close" to > where they have allocated their memory. We also generally try > to keep threads in the same process together, since they > generally tend to work on the same data. This might explain > why one of the boards is so much busier than the others. > > So yes, the interesting piece of this seems to be the higher > than expected run queue wait time (latency) as seen via > prstat -Lm. Even with the thread-to-board/memory affinity I > mentioned above, it generally shouldn't be the case that > threads are willing to hang out on a run queue waiting for a > CPU on their "home" when that thread *could* actually run > immediately on a "remote" (off-board) CPU. > Better to run remote, than not at all, or at least the saying goes :) > > In the case where a thread is dispatched remotely because all > home CPUs are busy, the thread will try to migrate back home > the next time it comes through the dispatcher and finds it > can run immediately at home (either because there's an idle > CPU, or because one of the running threads is lower priority > than us, and we can preempt it). This migrating around means > that the thread will tend to spend more time waiting on run > queues, since it has to either wait for the idle() thread to > switch off, or for the lower priority thread it's able to > preempt to surrender the CPU. Either way, the thread > shouldn't have to wait long to get the CPU, but it will have > to wait a non-zero amount of time. > > What does the prstat -Lm output look like exactly? Is it a > lot of wait time, or just more than you would expect? > > By the way, just to be clear, when I say "board" what I > should be saying is lgroup (or locality group). This is the > Solaris abstraction for a set of CPU and memory resources > that are close to one another. On your system, it turns out > that kernel creates an lgroup for each board, and each thread > is given an affinity for one of the lgroups, such that it > will try to run on the CPUs (and allocate memory from that > group of resources. > > One thing to look at here is whether or not the kernel could > be "overloading" a given lgroup. This would result in threads > tending to be less sucessful in getting CPU time (and/or > memory) in their home. At least for CPU time, you can see > this by looking at the number of migrations and where they > are taking place. If the thread isn't having much luck > running at home, this means that it (and others sharing it's > home) will tend to "ping-pong" between CPU in and out of the > home lgroup (we refer to this as the "king of the hill" > pathology). In your mpstat output, I see many migrations on > one of the boards, and a good many on the other boards as > well, so that might well be happening here. > > To get some additional observability into this issue, you > might want to take a look at some of our lgroup > observability/control tools we posted (available from the > performance community page). They allow you to do things like > query/set your application's lgroup affinity, find out about > the lgroups in the system, and what resources they contain, > etc. Using them you might be able to confirm some of my > theory above. We would also *very* much like any feedback you > (or anyone else) would be willing to provide on the tools. > > In the short term, there's a tunable I can suggest you take a > look at that deals with how hard the kernel tries to keep > threads of the same process together in the same lgroup. > Tuning this should result in your workload being spread out > more effectively than it currently seems to be. I'll post a > follow up message tomorrow morning with these details, if > you'd like to try this. > > In the medium-short term, we really need to implement a > mechanism to dynamically change a thread's lgroup affinity > when it's home becomes overloaded. We presently don't have > this, as the mechanism that determines a thread's home lgroup > (and does the lgroup load balancing) is static in nature > (done at thread creation time). (Implemented in > usr/src/uts/common/os/lgrp.c:lgrp_choose() if you'd like to > take a look a the source.) In terms of our NUMA/MPO projects, > this one is at the top of the 'ol TODO list. > This message posted from opensolaris.org > _______________________________________________ > perf-discuss mailing list > perf-discuss@opensolaris.org > _______________________________________________ perf-discuss mailing list perf-discuss@opensolaris.org