Thanks for the info.. installed htop .. looks nice. no database.. and the iostats and nfsiostats basically said 0% usage.

They ARE running multithreaded simulations.. and this machine has
been running almost 6 months (so would it catch up with a leap second issue? .
There are no PROCESSES causes problems -- they are just not
running as fast as they should be..
As the user said....

I have written my simulation program to run with several threads.
When I have started the program the first time on the above machine,
I have started it with ten threads and got a utilization for my
process that is close to 1000%. After I re-wrote my program and
started it again with the same configuration, I am only
getting about 300% utilization, so I cannot utilize the cores anymore.
...While I have changed my code, the stuff that deals with
thread creation and control has not changed.

Normally I would take statements like this with a grain salt but the
vmstat details look to be screwy and back up his symptoms.

Machine has 12 cores

Looking at /proc/interrupts, these lines change markedly
    Local timer interrupts - about 1 or 2 thousand per sec on each CPU
    Rescheduling interrupts - about 6 or 7 thousand per sec on each CPU
    TLB shootdowns - about 1 or 2 thousand per sec on each CPU
some of the  PCI-MSI-edge change .. but not by much.. everything else is pretty 
static.


Thanks for the offer but initially I'm trying to improve a gap in my diagnostic knowledge. :-) and maybe a FYI for the rest of Linux-users?

Pete
Re: WTF -- Fedora on a server is because ..Our labs all run fedora 16. This server is to run lab software while people are running on Windows or from home or from laptops.. There is sometimes some method to our madness :-) . Proper servers run RHEL round here


On 26/10/12 11:22, Steve Holdoway wrote:
Although you say it's a compute not an io server, those figures
initially scream IO problems... all that time in sys and a high load
average usually indicate that. Have you got a poorly configured database
on there by any chance?

Context switches and interrupts are pretty high... could be the leap
second screwed up locking if you're running multithreaded apps, although
I don't think there's been one for a few months now?

What info do you have in /proc/interupts - should give an idea of what
resource is being hit? Does (h)top identify the processes causing the
problem, strace it?

Happy to have a look if you wish... it's sort of what I do for a living.

Cheers,

Steve
( the real WTF is fedora on a server (: )

On Fri, 2012-10-26 at 10:40 +1300, Peter Glassenbury(UoC) wrote:
Hi all,

I have had a few years sorting out hardware performance on
some of our loaded linux compute servers. This one has me perplexed..
It is not too much of a worry but I would like to know why
this is happening .. and if I should be worried :-)

Machine (Fedora 16) is not running at peak performance but has done
so previous to last week  or so.. So there is possibly something about the mix 
of the current
jobs that is causing it.

top - 10:13:09 up 164 days,  1:40, 13 users,  load average: 12.31, 14.60, 15.11

Tasks: 305 total,   3 running, 299 sleeping,   3 stopped,   0 zombie

Cpu(s): 14.4%us, 22.2%sy,  0.0%ni, 59.9%id,  0.0%wa,  2.0%hi,  1.4%si,  0.0%st

Mem:  65979204k total, 37972660k used, 28006544k free,   617516k buffers

Swap: 16777212k total,   101216k used, 16675996k free, 32201980k cached

So heaps of memory, swap. CPU's are sitting idle a lot of the time
(They shouldn't be on this compute server)
nfsiostat and iostat show next to nothing happening
(its a compute server not an io server)

Vmstat has the weird bit that I haven't seen before. system interrupts and 
context switches are
through the roof for anything I have seen.

$ vmstat 1
procs -----------memory--------------  ---swap-- -----io---- --system--  
-----cpu-----
   r  b   swpd   free    buff     cache   si   so    bi bo   in   cs     us sy 
id wa st

   2  0 101216 28003040 617520 32203196   0    0     0 68   68295 71578 16 25 
59  0  0
12  0 101216 28004640 617520 32203196   0    0     0 0   71740 72877 14 24 62  
0  0
   6  0 101216 28001972 617520 32203196   0    0     0 0   70366 72381 14 25 61 
 0  0
   4  0 101216 27997920 617520 32203196   0    0     0 0   67163 68348 13 25 62 
 0  0

Googling found something about leap seconds and restarting ntp.. which I have 
done.
Anyone have ideas or suggestions on what to look at ?
I would prefer not to do the "three finger salute" on this machine
as some jobs have been running for weeks.

Cheers
Pete



_______________________________________________
Linux-users mailing list
[email protected]
http://lists.canterbury.ac.nz/mailman/listinfo/linux-users


--
Peter Glassenbury              Computer Science & Software Engineering
[email protected]     University of Canterbury
+64 3 3667001 ext 7762

_______________________________________________
Linux-users mailing list
[email protected]
http://lists.canterbury.ac.nz/mailman/listinfo/linux-users

Reply via email to