On Sunday, 6 בNovember 2005 22:13, Yedidyah Bar-David wrote:
> On Sun, Nov 06, 2005 at 01:05:39PM +0200, Oded Arbel wrote:
> > Hi list.
> >
> > I have a problem with a P4 (hyper-threaded) powered server. It
> > constantly has a load average of 2.something, while looking with
> > top I don't see any process actually taking all that CPU resource.
> > The server is mostly used to ran Nagios monitor and some Java
> > daemons. Tomcat is running taking about 1.2GB of virtual, which is
> > about 60% of all memory, but it sees absolutely no usage and uses
> > less then 5% real memory. two other java services and MySQL
> > together grab another 600MB of virtual and everything else is
> > mostly scripts and use negliable amounts of VIRT, RES and CPU.
>
> Maybe one of the scripts/daemons has a loop of quite short delays?
> Testing this isn't very easy - you can either strace some of the
> suspects or try something like syscalltrack.
Thanks
The server doesn't run a lot of processes (or shouldn't anyway). I
removed everything I didn't absolutely needed and straced all the other
non-kernel processes, and found nothing interesting.
Then I started removing processes until I got to the culprit - the Java
program that implements the services provided by the server.
I of course did the testing in the off-peak hours so there will be no
disturbance of service to our client. At that time there was absolutely
no activity whatsoever on any of the services, so the only thing the
Java program was supposed to do was call wait() (a Java thread
synchronization call) every second, which was indeed verified by
stracing the Java process, and here is the output:
futex(0x4d907b60, FUTEX_WAIT, 233, {0, 265545000}) = -1 ETIMEDOUT
(Connection timed out)
futex(0x805d33c, FUTEX_WAKE, 1) = 0
gettimeofday({1131445296, 417683}, NULL) = 0
clock_gettime(0, {1131445296, 417799000}) = 0
futex(0x4d907b60, FUTEX_WAIT, 234, {0, 499884000}) = -1 ETIMEDOUT
(Connection timed out)
futex(0x805d33c, FUTEX_WAKE, 1) = 0
gettimeofday({1131445296, 918529}, NULL) = 0
clock_gettime(0, {1131445296, 918646000}) = 0
futex(0x4d907b60, FUTEX_WAIT, 235, {0, 499883000}) = -1 ETIMEDOUT
(Connection timed out)
futex(0x805d33c, FUTEX_WAKE, 1) = 0
gettimeofday({1131445297, 419424}, NULL) = 0
clock_gettime(0, {1131445297, 419540000}) = 0
futex(0x4d907b60, FUTEX_WAIT, 236, {0, 499884000}) = -1 ETIMEDOUT
(Connection timed out)
futex(0x805d33c, FUTEX_WAKE, 1) = 0
gettimeofday({1131445297, 920319}, NULL) = 0
clock_gettime(0, {1131445297, 920436000}) = 0
...
and so on and so forth
Nonetheless, stopping the program resulted in immediate drop of CPU
usage to just about 0% and starting it again put it back to 100%.
Someone suggested that this is two separate issues:
1) procps is broken and misreports CPU usage of processes.
- I'm currently using procps 3.1.15 on a Mandrake 10.0 official vanilla
kernel 2.6.3-7mdk-p3-smp-64GB. I looked up in the procps changelog and
didn't find anything that sounds related between that above mentioned
version and the current one.
2) Java is eating up all CPU power.
I'm using Sun's J2RE 1.4.2_06 packaged by Mandriva. From the strace of
the process I can't see how it can consume all the CPU. Another Java
program running on the same machine, which is runs a much smaller
subset of the exact same code as the offending process - and with a
similar strace - does not show this behavior.
I suspect the futex() calls from the above trace - AFAIK they stand for
"fast use mutex", but I don't understand enough about them to guess as
to why it behaves that way.
--
Oded
::..
"The truth is more important than the facts."
-- Frank Lloyd Wright
================================================================To unsubscribe,
send mail to [EMAIL PROTECTED] with
the word "unsubscribe" in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]