Could be a shared-resource or thread management issue caused by that
user's application. It does sound like it at least:

* App running ten threads was getting 1000 % = 10 CPUs bandwidth.
* After "something" has been changed in the app, the ten threads only
get 300 % = 3 CPUs.

Some speculation on what could have gone wrong:

* The changed simulation app is very I/O heavy, and it is waiting for
I/O for most of the time, but then the "waiting" percentage should be
higher, shouldn't it? So this is probably not the case.
* Or the app's own thread management or synchronisation is messed up -
that could explain lots of context switches, couldn't it?

To find out whether that simulation is somehow defective itself, just
throw in another few single-thread apps that keep the CPUs busy
(non-optimised endless loops will do). These apps would take up
virtually no memory, and have neither I/O nor interrupts. You should
see "user" go up pretty sharply (by approximately 100 % per app, as
long as you have some free CPUs left to use). If that is the case,
then the simulation app is possibly outsmarting itself.

(And from experience - when computer users in an academic environment
are complaining about performance issues like things being slower than
they should, or continuously running out of memory, ask them what kind
of QA process their software has been subjected to... thinking of
stories like spending $$$ on bigger RAM for someone's simulations,
only to find out later they had a memory leak in their software.)

Kind regards,

Helmut.

On 26/10/2012, Peter Glassenbury(UoC)
<[email protected]> wrote:
> Thanks for the info.. installed htop .. looks nice. no database.. and the
> iostats and nfsiostats
> basically said 0% usage.
>
> They ARE running multithreaded simulations.. and this machine has
> been running almost 6 months (so would it catch up with a leap second issue?
> .
> There are no PROCESSES causes problems -- they are just not
> running as fast as they should be..
> As the user said....
>
> I have written my simulation program to run with several threads.
> When I have started the program the first time on the above machine,
> I have started it with ten threads and got a utilization for my
> process that is close to 1000%. After I re-wrote my program and
> started it again with the same configuration, I am only
> getting about 300% utilization, so I cannot utilize the cores anymore.
> ...While I have changed my code, the stuff that deals with
> thread creation and control has not changed.
>
> Normally I would take statements like this with a grain salt but the
> vmstat details look to be screwy and back up his symptoms.
>
> Machine has 12 cores
>
> Looking at /proc/interrupts, these lines change markedly
>      Local timer interrupts - about 1 or 2 thousand per sec on each CPU
>      Rescheduling interrupts - about 6 or 7 thousand per sec on each CPU
>      TLB shootdowns - about 1 or 2 thousand per sec on each CPU
> some of the  PCI-MSI-edge change .. but not by much.. everything else is
> pretty static.
>
>
> Thanks for the offer but initially I'm trying to improve a gap in my
> diagnostic knowledge. :-) and
> maybe a FYI for the rest of Linux-users?
>
> Pete
> Re: WTF -- Fedora on a server is because ..Our labs all run fedora 16. This
> server is to run lab
> software while people are running on Windows or from home or from laptops..
> There is sometimes  some
> method to our madness :-) . Proper servers run RHEL round here
>
>
> On 26/10/12 11:22, Steve Holdoway wrote:
>> Although you say it's a compute not an io server, those figures
>> initially scream IO problems... all that time in sys and a high load
>> average usually indicate that. Have you got a poorly configured database
>> on there by any chance?
>>
>> Context switches and interrupts are pretty high... could be the leap
>> second screwed up locking if you're running multithreaded apps, although
>> I don't think there's been one for a few months now?
>>
>> What info do you have in /proc/interupts - should give an idea of what
>> resource is being hit? Does (h)top identify the processes causing the
>> problem, strace it?
>>
>> Happy to have a look if you wish... it's sort of what I do for a living.
>>
>> Cheers,
>>
>> Steve
>> ( the real WTF is fedora on a server (: )
>>
>> On Fri, 2012-10-26 at 10:40 +1300, Peter Glassenbury(UoC) wrote:
>>> Hi all,
>>>
>>> I have had a few years sorting out hardware performance on
>>> some of our loaded linux compute servers. This one has me perplexed..
>>> It is not too much of a worry but I would like to know why
>>> this is happening .. and if I should be worried :-)
>>>
>>> Machine (Fedora 16) is not running at peak performance but has done
>>> so previous to last week  or so.. So there is possibly something about
>>> the mix of the current
>>> jobs that is causing it.
>>>
>>> top - 10:13:09 up 164 days,  1:40, 13 users,  load average: 12.31, 14.60,
>>> 15.11
>>>
>>> Tasks: 305 total,   3 running, 299 sleeping,   3 stopped,   0 zombie
>>>
>>> Cpu(s): 14.4%us, 22.2%sy,  0.0%ni, 59.9%id,  0.0%wa,  2.0%hi,  1.4%si,
>>> 0.0%st
>>>
>>> Mem:  65979204k total, 37972660k used, 28006544k free,   617516k buffers
>>>
>>> Swap: 16777212k total,   101216k used, 16675996k free, 32201980k cached
>>>
>>> So heaps of memory, swap. CPU's are sitting idle a lot of the time
>>> (They shouldn't be on this compute server)
>>> nfsiostat and iostat show next to nothing happening
>>> (its a compute server not an io server)
>>>
>>> Vmstat has the weird bit that I haven't seen before. system interrupts
>>> and context switches are
>>> through the roof for anything I have seen.
>>>
>>> $ vmstat 1
>>> procs -----------memory--------------  ---swap-- -----io---- --system--
>>> -----cpu-----
>>>    r  b   swpd   free    buff     cache   si   so    bi bo   in   cs
>>> us sy id wa st
>>>
>>>    2  0 101216 28003040 617520 32203196   0    0     0 68   68295 71578
>>> 16 25 59  0  0
>>> 12  0 101216 28004640 617520 32203196   0    0     0 0   71740 72877 14
>>> 24 62  0  0
>>>    6  0 101216 28001972 617520 32203196   0    0     0 0   70366 72381 14
>>> 25 61  0  0
>>>    4  0 101216 27997920 617520 32203196   0    0     0 0   67163 68348 13
>>> 25 62  0  0
>>>
>>> Googling found something about leap seconds and restarting ntp.. which I
>>> have done.
>>> Anyone have ideas or suggestions on what to look at ?
>>> I would prefer not to do the "three finger salute" on this machine
>>> as some jobs have been running for weeks.
>>>
>>> Cheers
>>> Pete
>>>
>>
>>
>> _______________________________________________
>> Linux-users mailing list
>> [email protected]
>> http://lists.canterbury.ac.nz/mailman/listinfo/linux-users
>
>
> --
> Peter Glassenbury              Computer Science & Software Engineering
> [email protected]     University of Canterbury
> +64 3 3667001 ext 7762
>
>
_______________________________________________
Linux-users mailing list
[email protected]
http://lists.canterbury.ac.nz/mailman/listinfo/linux-users

Reply via email to