Hi all:

I've got a pvfs2 install on my cluster.  I never felt it was
performing up to snuff, but lately it seems that things have gone way,
way down in total throughput and overall usability.  To the tune that
jobs writing out 900MB will take an extra 1-2 hours to complete due to
disk I/O waits.  A 2-hr job that would write about 30GB over the
course of the run (normally about 2hrs long) takes up to 20hrs.  Once
the disk I/O is cut out, it completes in 1.5-2hrs.  I've noticed
personally that there's up to a 5 sec lag time when I cd into
/mnt/pvfs2 and do an ls.  Note that all of our operations are using
the kernel module / mount point.  Our problems and code base do not
support the use of other tools (such as the pvfs2-* or the native MPI
libraries); its all done through the kernel module / filesystem
mountpoint.

My configuration is this:  3 pvfs2 servers (Dell PowerEdge 1950's with
1.6Ghz quad-core CPUs, 4GB ram, raid-0 for metadata+os on perc5i
card), Dell Perc6e card with hardware raid6 in two volumes: one on a
bunch of 750GB sata drives, and the other on its second SAS connector
to about 12 2tb WD drives.  The two raid volumes are lvm'ed together
in the OS and mounted as the pvfs2 data store.  Each server is
connected via ethernet to a stack of LG-errison gig-e switches
(stack==2 switches with 40Gbit stacking cables installed).  PVFS 2.8.2
used throughout the cluster on Rocks (using site-compiled pvfs, not
the rocks-supplied pvfs).  OSes are CentOS5-x-based (both clients and
servers).

As I said, I always felt something wasn't quite right, but a few
months back, I performed a series of upgrades and reconfigurations on
the infrastructure and hardware.  Specifically, I upgraded to the
lg-errison switches and replaced a full 12-bay drive shelf with a
24-bay one (moving all the disks through) and adding some additional
disks.  All three pvfs2 servers are identical in this.  At some point
prior to these changes, my users were able to get acceptable
performance from pvfs2; now they are not.  I don't have any evidence
pointing to the switch or to the disks.

I can run dd if=/dev/zero of=testfile bs=1024k count=10000 and get
380+MB/s locally on the pvfs server, writing to the partition on the
hardware raid6 card.  From a compute node, doing that for 100MB file,
I get 47.7MB/s to my RAID-5 NFS server on the head node, and 36.5MB/s
to my pvfs2 mounted share.  When I watch the network
bandwidth/throughput using bwm-ng, I rarely see more than 10MB/s, and
often its around 4MB/s with a 12-node IO-bound job running.

I originally had the pvfs2 servers connected to the switch with dual
gig-e connections and using bonding (ALB) to make it more able to
serve multiple nodes.  I never saw anywhere close to the throughput I
should.  In any case, to test of that was the problem, I removed the
bonding and am running through a single gig-e pipe now, but
performance hasn't improved at all.

I'm not sure how to troubleshoot this problem further.  Presently, the
cluster isn't usable for large I/O jobs, so I really have to fix this.

--Jim
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to