Hi all:

I'm (once again) experiencing system instability that appears to be
traceable to pvfs2.  Symptoms usually show up when one or more users
start long SCP sessions for transferring 5+GB of data lasting several
hours.  I believe they usually have 1-3 sessions running in parallel.
Symptoms include:

* High load averages (and climbing slowly with additional use) without
supporting CPU load.  The ONLY way to recover from this is reboot.  My
load average is currently 7.10 with 99.8% idle CPU.
* hung SCP and other I/O processes
* large amounts of RAM "missing" (Currently free -m reports 7552MB in
use; adding up usage from all processes comes to about 1GB.
* Often (always?) some users' files become unaccessible (although
users have stopped reporting those problems as its happened so
frequently).

If I let this go a bit longer, there's a reasonable chance that the
machine will just spontaneously reboot.  There's nothing logged as to
the cause.  No OOM or other errors...Just one minute everything's
fine; the next its booting up.

Sometimes it will take a long time for these problems to build up (for
example, right now the system load and memory issues are here with a
couple days of "building"); sometimes the system will spontaneously
reboot several times in one day (with no notice of climbing loads or
the like).

These problems so far have only happened on the head node (pvfs
client); our compute nodes have not shown this problem.

System configuration:
Rocks 5.1 with manual pvfs setup (NOT using rocks-supplied PVFS
binaries or configurations)
pvfs 2.7.1 + patches from pcarns
3 CentOS 5 dedicated PVFS servers (each with ~10TB storage, Dell PERC
6/e + MD1000's)
PVFS servers are running over bonded dual-gig connections using linux
kernel ethernet bonding driver
Clients are single-gig connected.
no off-site pvfs2 access (scp/ssh/sftp access only, via the head node)

Any suggestions?
I'm getting fairly desperate for help, as pvfs2 has been the main
destabilizing factor for the cluster since it went online, and causing
spontaneous reboots is not a good thing....

Thanks!
--Jim
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to