Hi David,
I am certainly no expert but this looks to me like the classic NFS
symptoms when the server gets overloaded, or a disk or the network gets
flaky.
If it were me, I'd try to get the class to do more local i/o (if
possible). Perhaps a scratch area on the local disk would solve the
problem.
I think you could reproduce the problem by writing a test script that
does heavy i/o to the network folders and then running on more and more
machines and watch the i/o throughput approach zero with the machines
hung while waiting for NFS.
Again, I'm no expert feel free to ignore me.
Joe
On 11/29/2012 10:49 AM, David Fitzgerald wrote:
Last night during class time I had a chance to check some of the machines with
the frozen displays, and I am not sure what to make of what I found. Running
'lsof -p $PID' with (PID being 5044) on one of the affected machines, gave
this which, doesn't tell me much:
10.10.10 5044 root cwd DIR 8,7 4096 2 /
10.10.10 5044 root rtd DIR 8,7 4096 2 /
10.10.10 5044 root txt unknown /proc/5044/exe
I also ran pstree and I will put that output below, but I think I may be
barking up the wrong tree. While some of my clients were freezing up, I saw
that my NFS server was getting very high 'top' loads. Fortunately I have
sysstat running on the server and after class 'sar -u' showed that %iowait went
from less than 1 before class to a high of 53 after class began, and stayed
high until class ended. Here is the relevant 'chunk' of the sar -u output:
05:20:01 PM all 0.03 0.00 0.07 0.17 0.00 99.73
05:30:01 PM all 0.03 0.00 0.03 0.11 0.00 99.83
05:40:01 PM all 0.18 0.00 0.50 1.88 0.00 97.44
05:50:01 PM all 0.16 0.00 1.12 6.93 0.00 91.78
06:00:01 PM all 0.73 0.00 5.23 32.61 0.00 61.43
06:10:01 PM all 0.77 0.00 6.55 53.67 0.00 39.01
06:20:01 PM all 0.13 0.00 4.81 27.81 0.00 67.25
06:30:01 PM all 0.13 0.00 6.69 21.71 0.00 71.47
06:40:01 PM all 0.11 0.00 3.47 33.34 0.00 63.08
06:50:01 PM all 0.11 0.00 3.20 31.02 0.00 65.67
07:00:01 PM all 0.24 0.00 3.93 30.79 0.00 65.05
07:10:01 PM all 0.16 0.00 3.63 20.51 0.00 75.71
07:20:01 PM all 0.18 0.00 5.23 1.45 0.00 93.13
07:30:01 PM all 0.10 0.00 5.72 0.70 0.00 93.48
Average: all 0.06 0.01 0.46 2.13 0.00 97.34
The NFS server is a virtual machine in running ESXI 4.1 and VMware tools IS
installed. Could this be slow disk access, and thus a VMware misconfiguration?
I hate to admit it, but I am at a loss.
I can run other sar reports on yesterday's (Wednesday's) data if anyone thinks
there may be something in there to help.
For what its worth, here is the output from pstree from one of the affected
clients, and I do NOT see the PID that I was looking for:
init(1)-+-NetworkManager(1782)-+-dhclient(1808)
| `-{NetworkManager}(1809)
|-abrtd(2341)
|-acpid(2039)
|-anacron(3615)
|-atd(2413)
|-atieventsd(2421)---authatieventsd.(4134)
|-auditd(1547)-+-audispd(1549)-+-sedispatch(1550)
| | `-{audispd}(1551)
| `-{auditd}(1548)
|-automount(2134)-+-{automount}(2135)
| |-{automount}(2136)
| |-{automount}(2139)
| |-{automount}(2142)
| |-{automount}(2143)
| `-{automount}(2144)
|-avahi-daemon(1794)---avahi-daemon(1795)
|-bonobo-activati(4549)---{bonobo-activat}(4550)
|-cachefilesd(1597)
|-certmonger(2435)
|-clock-applet(4644)
|-console-kit-dae(2521)-+-{console-kit-da}(2522)
| |-{console-kit-da}(2523)
| |-{console-kit-da}(2524)
| |-{console-kit-da}(2525)
| |-{console-kit-da}(2526)
| |-{console-kit-da}(2527)
| |-{console-kit-da}(2528)
| |-{console-kit-da}(2529)
| |-{console-kit-da}(2530)
| |-{console-kit-da}(2531)
| |-{console-kit-da}(2532)
| |-{console-kit-da}(2533)
| |-{console-kit-da}(2534)
| |-{console-kit-da}(2535)
| |-{console-kit-da}(2536)
| |-{console-kit-da}(2537)
| |-{console-kit-da}(2538)
| |-{console-kit-da}(2539)
| |-{console-kit-da}(2540)
| |-{console-kit-da}(2541)
| |-{console-kit-da}(2542)
| |-{console-kit-da}(2543)
| |-{console-kit-da}(2544)
| |-{console-kit-da}(2545)
| |-{console-kit-da}(2546)
| |-{console-kit-da}(2547)
| |-{console-kit-da}(2548)
| |-{console-kit-da}(2549)
| |-{console-kit-da}(2550)
| |-{console-kit-da}(2551)
| |-{console-kit-da}(2552)
| |-{console-kit-da}(2553)
| |-{console-kit-da}(2554)
| |-{console-kit-da}(2555)
| |-{console-kit-da}(2556)
| |-{console-kit-da}(2557)
| |-{console-kit-da}(2558)
| |-{console-kit-da}(2559)
| |-{console-kit-da}(2560)
| |-{console-kit-da}(2561)
| |-{console-kit-da}(2562)
| |-{console-kit-da}(2563)
| |-{console-kit-da}(2564)
| |-{console-kit-da}(2565)
| |-{console-kit-da}(2566)
| |-{console-kit-da}(2567)
| |-{console-kit-da}(2568)
| |-{console-kit-da}(2569)
| |-{console-kit-da}(2570)
| |-{console-kit-da}(2571)
| |-{console-kit-da}(2572)
| |-{console-kit-da}(2573)
| |-{console-kit-da}(2574)
| |-{console-kit-da}(2575)
| |-{console-kit-da}(2576)
| |-{console-kit-da}(2577)
| |-{console-kit-da}(2578)
| |-{console-kit-da}(2579)
| |-{console-kit-da}(2580)
| |-{console-kit-da}(2581)
| |-{console-kit-da}(2582)
| |-{console-kit-da}(2583)
| `-{console-kit-da}(2585)
|-crond(2402)
|-cupsd(1955)
|-dbus-daemon(1772)
|-dbus-daemon(2883)
|-dbus-launch(2591)
|-dbus-launch(2882)
|-devkit-power-da(2602)
|-fcoemon(1760)
|-firefox(4968)
|-gconf-im-settin(4534)
|-gconfd-2(3175)
|-gdm-binary(2449)---gdm-simple-slav(2490)-+-Xorg(2492)
|
`-gdm-session-wor(2671)---tcsh(2849)---gnome-session(4148)-+-bluetooth-apple(436+
|
|-gdu-notificatio(432+
|
|-gnome-panel(4253)
|
|-gnome-power-man(434+
|
|-gnome-volume-co(432+
|
|-gpk-update-icon(430+
|
|-krb5-auth-dialo(435+
|
|-metacity(4244)
|
|-nautilus(4276)
|
|-nm-applet(4342)
|
|-polkit-gnome-au(432+
|
|-python(4294)
|
`-{gnome-session}(422+
|-gdm-user-switch(4640)
|-gedit(4779)-+-{gedit}(4894)
| |-{gedit}(5037)
| |-{gedit}(5038)
| `-{gedit}(5039)
|-gnome-keyring-d(2831)-+-{gnome-keyring-}(2832)
| `-{gnome-keyring-}(4237)
|-gnome-screensav(4665)
|-gnome-settings-(4235)---{gnome-settings}(4248)
|-gnote(4635)
|-gvfs-afc-volume(4573)---{gvfs-afc-volum}(4574)
|-gvfs-gdu-volume(4569)
|-gvfs-gphoto2-vo(4571)
|-gvfsd(3168)
|-gvfsd-burn(4754)
|-gvfsd-metadata(4794)
|-gvfsd-trash(4656)
|-hald(2048)---hald-runner(2049)-+-hald-addon-acpi(2096)
| |-hald-addon-inpu(2088)
| `-hald-addon-stor(2097)
|-im-settings-dae(4371)
|-lldpad(1734)
|-master(2332)-+-pickup(2347)
| `-qmgr(2348)
|-mingetty(2454)
|-mingetty(2456)
|-mingetty(2458)
|-mingetty(2460)
|-mingetty(2462)
|-modem-manager(1789)
|-notification-ar(4642)
|-ntpd(2249)
|-pcscd(2114)---{pcscd}(2129)
|-polkitd(2647)
|-pulseaudio(4331)-+-gconf-helper(4563)
| |-{pulseaudio}(4535)
| `-{pulseaudio}(4539)
|-qpidd(2356)-+-{qpidd}(2357)
| |-{qpidd}(2358)
| `-{qpidd}(2359)
|-rpc.idmapd(1864)
|-rpc.mountd(2190)
|-rpc.rquotad(2175)
|-rpc.statd(1818)
|-rpcbind(1648)
|-rsyslogd(1574)-+-{rsyslogd}(1575)
| |-{rsyslogd}(1576)
| `-{rsyslogd}(1578)
|-rtkit-daemon(2661)-+-{rtkit-daemon}(2662)
| `-{rtkit-daemon}(2663)
|-seahorse-agent(3155)
|-seahorse-daemon(4243)
|-sshd(2233)---sshd(5003)---bash(5005)---pstree(5057)
|-sssd(2216)-+-sssd_be(2281)
| |-sssd_nss(2286)
| `-sssd_pam(2287)
|-stap-serverd(1927)---{stap-serverd}(1932)
|-udevd(542)-+-udevd(1166)
| `-udevd(1745)
|-udisks-daemon(4373)---udisks-daemon(4374)
|-wpa_supplicant(1813)
`-xinetd(2241)
________________________________________
From: Christopher Tooley [[email protected]]
Sent: Wednesday, November 28, 2012 1:00 PM
To: David Fitzgerald
Cc: [email protected]
Subject: Re: clients slow down due to unknown process
If/when you find out what it is, would you kindly report back to the list what
you find? This has got me really curious now. :D
-Chris
On 2012-11-28, at 5:51 AM, David Fitzgerald<[email protected]>
wrote:
Thank you everyone for all the good ideas. I have class this evening and will
be able to use your suggestions. I'll let you know what I find.
Dave
-----Original Message-----
From: Robert Blair [mailto:[email protected]]
Sent: Tuesday, November 27, 2012 11:56 AM
To: Sergio Ballestrero
Cc: David Fitzgerald; [email protected]
Subject: Re: clients slow down due to unknown process
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
"/usr/sbin/lsof -p $PID" will also list all of the resources it uses which is
often a big help in figuring out wtf it is all about.
On 11/27/2012 10:52 AM, Sergio Ballestrero wrote:
Hello David,
I'm not familiar with freeIPA, but anyway you can start by better
identifying the process.
In top, get the PID and look under /proc/$PID - in particular exe
will be a link to the binary, like lrwxrwxrwx 1 root root 0 Nov 27
01:41 /proc/1/exe -> /sbin/init
pstree -p -H $PID
will help you identify the parent process, if there's one.
Cheers,
Sergio
On 27 Nov 2012, at 16:21, David Fitzgerald wrote:
Hello,
Sorry for the length of this post, but I want to make sure I give all
the information needed for someone to help.
I have a lab of 25 workstations running Scientific Linux 6.2. User
accounts are authenticated via freeIPA, and auto mounted to an NFS
server and the users use Gnome 2.8. The NFS and freeIPA servers are
located on the same server (IP 10.10.10.10) which is also running
Scientific Linux 6.2 and is a virtual guest in VMware ESXI 4.1.
During class when the workstations are most heavily in use, the
students are writing Fortran programs with gedit and usually have
firefox up as well. Here is my predicament. During class some of
the workstation screens will freeze with no mouse or keyboard input.
This can last for varying lengths of time, sometimes a few minutes,
some other times for the full length of the class. I can ssh in to
the frozen machines and top will show load averages of up to 4 or more.
The process taking up the most CPU is one I don't recognize named
10.10.10.10-ma. The 10.10.10.10 being the IP address of my server.
I have no idea what that process is related to, whether it's freeIPA,
NFS, Gnome or something else. Killing the process doesn't help as it
simply restarts with a new PID. Note that the freezing does NOT
happen when only a few people are using the lab, so reproducing the
problem outside of class time is difficult.
Can anyone help me track down this problem and fix it?
I appreciate any help you can give.
Thanks!
Dave
+++++++++++++++++++++++
David Fitzgerald
Department of Earth Sciences
Millersville University
Millersville, PA 17551
Phone: 717-871-2394
--
Sergio Ballestrero - http://physics.uj.ac.za/psiwiki/Ballestrero
University of Johannesburg, Physics Department ATLAS TDAQ sysadmin
team - Office:75282 OnCall:164851
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)
iQEUAwUBULTwmfQM1KNWz8QaAQLU0Qf2JXa29RVDhJALq2TD72Nis4wAmxlqFIYP
rIo5sHBUI+o/bebsDit9qoC+hWuCK3+xDai9fzF2jUQqXfhRZiPHjdQRpCViMurY
Wp+aVZWCD1U3KusuWMSWlv6Xdx0QmaMQr8Nh8JRRWUi8cNEgAO2Th1txwdu3auJb
LssTFmwUjLUEC0mKhgx6520hisirfOHNTnF3rQCN5ilZGEYEZ2vMm/lcm5yI0Sqc
wdqWUXVYGNsBepFf4bRWaWPX0Hbf6sbLgoJNUHJOJ2pGpc3MUp3SiGsIIUGkZwPW
xT6kS523J+nItY/odmvdl+ibHRVa7TgDx0xhuqISarr39g00yvvx
=RQky
-----END PGP SIGNATURE-----