See below for specific items. Can you run iostat on the servers while writing a file that experiences the slow performance? If you could watch iostat -dmx <device of pvfs storage space> and provide any salient snippets (high utilization, low utilization, odd looking output, etc) that could help.
On Thu, Sep 29, 2011 at 11:42 AM, Jim Kusznir <[email protected]> wrote: > 1) iperf (defaults) reported 873, 884, and 929 for connections form > the three servers to the head node (a pvfs2 client) > Just to be clear, those are Mbps, right? > > 2) no errors showed up on any of the ports on the managed switch. > Hmm, if those are Mbps not seeming to be a network layer. > > 3) I'm not sure what this will do, as the pvfs2 volume is comprised of > 3 servers, so mounting it on a server still uses the network for the > other two. I also don't understand "single file per datafile" > statement. In any case, I do not have the kernel module compiled on > my servers; they ONLY have the pvfs2 server software installed. > > A logical file (e.g. foo.out) in a PVFS2 file system is made up of one or more datafiles. Based on your config I would assume most are made up of 3 datafiles with the default stripe size of 64k. You can run pvfs2-viewdist -f <file name> to see what the distribution and what servers a given file lives on. To see cumulative throughput from multiple PVFS2 servers the number of datafiles must be greater than one. Check a couple of the problematic files to see what their distribution is. For a quick test to see if the distribution is impacting performance set the following extended attribute on a directory and then check the performance of writing a file into it: setfattr -n user.pvfs2.num_dfiles -v "3" <some pvfs2 dir> Also, you can test if a larger strip_size would help doing a something similar to (for 256k strip): setfattr -n user.pvfs2.dist_name -v simple_stripe <some pvfs2 dir> setfattr -n user.pvfs2.dist_params -v strip_size:262144 <some pvfs2 dir> > 4) I'm not sure; I used largely defaults. I've attached my config below. > > 5) the network bandwidth is on one of the servers (the one I checked; > I believe them to all be similar). > > 6) Not sure. I have created an XFS filesystem using LVM to combine > the two hardware raid6 volumes and mounted that at /mnt/pvfs2 on the > servers. I then let pvfs do its magic. Config files below. > > 7(from second e-mail): Config file attached. > > ---------- > /etc/pvfs2-fs.conf: > ---------- > [root@pvfs2-io-0-2 mnt]# cat /etc/pvfs2-fs.conf > <Defaults> > UnexpectedRequests 50 > EventLogging none > LogStamp datetime > BMIModules bmi_tcp > FlowModules flowproto_multiqueue > PerfUpdateInterval 1000 > ServerJobBMITimeoutSecs 30 > ServerJobFlowTimeoutSecs 30 > ClientJobBMITimeoutSecs 300 > ClientJobFlowTimeoutSecs 300 > ClientRetryLimit 5 > ClientRetryDelayMilliSecs 2000 > StorageSpace /mnt/pvfs2 > LogFile /var/log/pvfs2-server.log > </Defaults> > > <Aliases> > Alias pvfs2-io-0-0 tcp://pvfs2-io-0-0:3334 > Alias pvfs2-io-0-1 tcp://pvfs2-io-0-1:3334 > Alias pvfs2-io-0-2 tcp://pvfs2-io-0-2:3334 > </Aliases> > > <Filesystem> > Name pvfs2-fs > ID 62659950 > RootHandle 1048576 > <MetaHandleRanges> > Range pvfs2-io-0-0 4-715827885 > Range pvfs2-io-0-1 715827886-1431655767 > Range pvfs2-io-0-2 1431655768-2147483649 > </MetaHandleRanges> > <DataHandleRanges> > Range pvfs2-io-0-0 2147483650-2863311531 > Range pvfs2-io-0-1 2863311532-3579139413 > Range pvfs2-io-0-2 3579139414-4294967295 > </DataHandleRanges> > <StorageHints> > TroveSyncMeta yes > TroveSyncData no > </StorageHints> > </Filesystem> > > > --------------------- > /etc/pvfs2-server.conf-pvfs2-io-0-2 > --------------------- > StorageSpace /mnt/pvfs2 > HostID "tcp://pvfs2-io-0-2:3334" > LogFile /var/log/pvfs2-server.log > --------------------- > > All the server config files are very similar. > > --Jim > > > On Wed, Sep 28, 2011 at 4:45 PM, Michael Moore <[email protected]> > wrote: > > No doubt something is awry. Offhand I'm suspecting the network. A couple > > things that might help give a direction: > > 1) Do an end-to-end TCP test between client/server. Something like iperf > or > > nuttcp should do the trick. > > 2) Check server and client ethernet ports on the switch for high error > > counts (not familiar with that switch, not sure if it's managed or not). > > Hardware (port/cable) errors should show up in the above test. > > 3) Can you mount the PVFS2 file system on the server and run some I/O > tests > > (single datafile per file) to see if the network is in fact in play. > > 4) What are the number of datafiles (by default) each file you're writing > to > > is using? 3? > > 5) When you watch network bandwidth and see 10 MB/s where is that? On the > > server? > > 6) What backend are you using for I/O, direct or alt-aio. Nothing really > > wrong either way, just wondering. > > > > It sounds like based on the dd output the disks are capable of more than > > you're seeing, just need to narrow down where the performance is getting > > squelched. > > > > Michael > > > > On Wed, Sep 28, 2011 at 6:10 PM, Jim Kusznir <[email protected]> wrote: > >> > >> Hi all: > >> > >> I've got a pvfs2 install on my cluster. I never felt it was > >> performing up to snuff, but lately it seems that things have gone way, > >> way down in total throughput and overall usability. To the tune that > >> jobs writing out 900MB will take an extra 1-2 hours to complete due to > >> disk I/O waits. A 2-hr job that would write about 30GB over the > >> course of the run (normally about 2hrs long) takes up to 20hrs. Once > >> the disk I/O is cut out, it completes in 1.5-2hrs. I've noticed > >> personally that there's up to a 5 sec lag time when I cd into > >> /mnt/pvfs2 and do an ls. Note that all of our operations are using > >> the kernel module / mount point. Our problems and code base do not > >> support the use of other tools (such as the pvfs2-* or the native MPI > >> libraries); its all done through the kernel module / filesystem > >> mountpoint. > >> > >> My configuration is this: 3 pvfs2 servers (Dell PowerEdge 1950's with > >> 1.6Ghz quad-core CPUs, 4GB ram, raid-0 for metadata+os on perc5i > >> card), Dell Perc6e card with hardware raid6 in two volumes: one on a > >> bunch of 750GB sata drives, and the other on its second SAS connector > >> to about 12 2tb WD drives. The two raid volumes are lvm'ed together > >> in the OS and mounted as the pvfs2 data store. Each server is > >> connected via ethernet to a stack of LG-errison gig-e switches > >> (stack==2 switches with 40Gbit stacking cables installed). PVFS 2.8.2 > >> used throughout the cluster on Rocks (using site-compiled pvfs, not > >> the rocks-supplied pvfs). OSes are CentOS5-x-based (both clients and > >> servers). > >> > >> As I said, I always felt something wasn't quite right, but a few > >> months back, I performed a series of upgrades and reconfigurations on > >> the infrastructure and hardware. Specifically, I upgraded to the > >> lg-errison switches and replaced a full 12-bay drive shelf with a > >> 24-bay one (moving all the disks through) and adding some additional > >> disks. All three pvfs2 servers are identical in this. At some point > >> prior to these changes, my users were able to get acceptable > >> performance from pvfs2; now they are not. I don't have any evidence > >> pointing to the switch or to the disks. > >> > >> I can run dd if=/dev/zero of=testfile bs=1024k count=10000 and get > >> 380+MB/s locally on the pvfs server, writing to the partition on the > >> hardware raid6 card. From a compute node, doing that for 100MB file, > >> I get 47.7MB/s to my RAID-5 NFS server on the head node, and 36.5MB/s > >> to my pvfs2 mounted share. When I watch the network > >> bandwidth/throughput using bwm-ng, I rarely see more than 10MB/s, and > >> often its around 4MB/s with a 12-node IO-bound job running. > >> > >> I originally had the pvfs2 servers connected to the switch with dual > >> gig-e connections and using bonding (ALB) to make it more able to > >> serve multiple nodes. I never saw anywhere close to the throughput I > >> should. In any case, to test of that was the problem, I removed the > >> bonding and am running through a single gig-e pipe now, but > >> performance hasn't improved at all. > >> > >> I'm not sure how to troubleshoot this problem further. Presently, the > >> cluster isn't usable for large I/O jobs, so I really have to fix this. > >> > >> --Jim > >> _______________________________________________ > >> Pvfs2-users mailing list > >> [email protected] > >> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users > > > > >
_______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
