Oh, sorry, I didn't answer your question about the latency observation
with prstat. Not high, just higher than expected, approaching 1% which
is of course higher than what I obtain when I coerce the situation by
binding. 

> -----Original Message-----
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of 
> Eric C. Saxe
> Sent: Thursday, September 01, 2005 1:48 AM
> To: perf-discuss@opensolaris.org
> Subject: [perf-discuss] Re: Puzzling scheduler behavior
> 
> Hi David,
> 
> Since your v1280 systems has NUMA characteristics, the bias 
> that you see for one of the boards may be a result of the 
> kernel trying to run your application's threads "close" to 
> where they have allocated their memory. We also generally try 
> to keep threads in the same process together, since they 
> generally tend to work on the same data. This might explain 
> why one of the boards is so much busier than the others. 
> 
> So yes, the interesting piece of this seems to be the higher 
> than expected run queue wait time (latency) as seen via 
> prstat -Lm. Even with the thread-to-board/memory affinity I 
> mentioned above, it generally shouldn't be the case that 
> threads are willing to hang out on a run queue waiting for a 
> CPU on their "home" when that thread *could* actually run 
> immediately on a "remote" (off-board) CPU.
> Better to run remote, than not at all, or at least the saying goes :)
> 
> In the case where a thread is dispatched remotely because all 
> home CPUs are busy, the thread will try to migrate back home 
> the next time it comes through the dispatcher and finds it 
> can run immediately at home (either because there's an idle 
> CPU, or because one of the running threads is lower priority 
> than us, and we can preempt it). This migrating around means 
> that the thread will tend to spend more time waiting on run 
> queues, since it has to either wait for the idle() thread to 
> switch off, or for the lower priority thread it's able to 
> preempt to surrender the CPU. Either way, the thread 
> shouldn't have to wait long to get the CPU, but it will have 
> to wait a non-zero amount of time.
> 
> What does the prstat -Lm output look like exactly? Is it a 
> lot of wait time, or just more than you would expect?
> 
> By the way, just to be clear, when I say "board" what I 
> should be saying is lgroup (or locality group). This is the 
> Solaris abstraction for a set of CPU and memory resources 
> that are close to one another. On your system, it turns out 
> that kernel creates an lgroup for each board, and each thread 
> is given an affinity for one of the lgroups, such that it 
> will try to run on the CPUs (and allocate memory from that 
> group of resources.
> 
> One thing to look at here is whether or not the kernel could 
> be "overloading" a given lgroup. This would result in threads 
> tending to be less sucessful in getting CPU time (and/or 
> memory) in their home. At least for CPU time, you can see 
> this by looking at the number of migrations and where they 
> are taking place. If the thread isn't having much luck 
> running at home, this means that it (and others sharing it's 
> home) will tend to "ping-pong" between CPU in and out of the 
> home lgroup (we refer to this as the "king of the hill" 
> pathology). In your mpstat  output, I see many migrations on 
> one of the boards, and a good many on the other boards as 
> well, so that might well be happening here.
> 
> To get some additional observability into this issue, you 
> might want to take a look at some of our lgroup 
> observability/control tools we posted (available from the 
> performance community page). They allow you to do things like 
> query/set your application's lgroup affinity, find out about 
> the lgroups in the system, and what resources they contain, 
> etc. Using them you might be able to confirm some of my 
> theory above. We would also *very* much like any feedback you 
> (or anyone else) would be willing to provide on the tools.
> 
> In the short term, there's a tunable I can suggest you take a 
> look at that deals with how hard the kernel tries to keep 
> threads of the same process together in the same lgroup. 
> Tuning this should result in your workload being spread out 
> more effectively than it currently seems to be. I'll post a 
> follow up message tomorrow morning with these details, if 
> you'd like to try this.
> 
> In the medium-short term, we really need to implement a 
> mechanism to dynamically change a thread's lgroup affinity 
> when it's home becomes overloaded. We presently don't have 
> this, as the mechanism that determines a thread's home lgroup 
> (and does the lgroup load balancing) is static in nature 
> (done at thread creation time). (Implemented in 
> usr/src/uts/common/os/lgrp.c:lgrp_choose() if you'd like to 
> take a look a the source.) In terms of our NUMA/MPO projects, 
> this one is at the top of the 'ol TODO list.
> This message posted from opensolaris.org 
> _______________________________________________
> perf-discuss mailing list
> perf-discuss@opensolaris.org
> 
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

Reply via email to