All speeds were in Mpbs, the default from iperf. Our files are multi-GB in size, so they do involve all three servers. It applies to all files on the system.
Can I change the stripe size "on the go"? I already have about 50TB of data in the system, and have no place big enough to back it up to rebuild the pvfs2 array and restore.... --Jim On Fri, Sep 30, 2011 at 1:46 PM, Michael Moore <[email protected]> wrote: > See below for specific items. Can you run iostat on the servers while > writing a file that experiences the slow performance? If you could watch > iostat -dmx <device of pvfs storage space> and provide any salient snippets > (high utilization, low utilization, odd looking output, etc) that could > help. > > On Thu, Sep 29, 2011 at 11:42 AM, Jim Kusznir <[email protected]> wrote: >> >> 1) iperf (defaults) reported 873, 884, and 929 for connections form >> the three servers to the head node (a pvfs2 client) > > Just to be clear, those are Mbps, right? > >> >> 2) no errors showed up on any of the ports on the managed switch. > > Hmm, if those are Mbps not seeming to be a network layer. >> >> 3) I'm not sure what this will do, as the pvfs2 volume is comprised of >> 3 servers, so mounting it on a server still uses the network for the >> other two. I also don't understand "single file per datafile" >> statement. In any case, I do not have the kernel module compiled on >> my servers; they ONLY have the pvfs2 server software installed. >> > > A logical file (e.g. foo.out) in a PVFS2 file system is made up of one or > more datafiles. Based on your config I would assume most are made up of 3 > datafiles with the default stripe size of 64k. > > You can run pvfs2-viewdist -f <file name> to see what the distribution and > what servers a given file lives on. To see cumulative throughput from > multiple PVFS2 servers the number of datafiles must be greater than one. > Check a couple of the problematic files to see what their distribution is. > > For a quick test to see if the distribution is impacting performance set the > following extended attribute on a directory and then check the performance > of writing a file into it: > setfattr -n user.pvfs2.num_dfiles -v "3" <some pvfs2 dir> > > Also, you can test if a larger strip_size would help doing a something > similar to (for 256k strip): > setfattr -n user.pvfs2.dist_name -v simple_stripe <some pvfs2 dir> > setfattr -n user.pvfs2.dist_params -v strip_size:262144 <some pvfs2 dir> > >> >> 4) I'm not sure; I used largely defaults. I've attached my config below. >> >> 5) the network bandwidth is on one of the servers (the one I checked; >> I believe them to all be similar). >> >> 6) Not sure. I have created an XFS filesystem using LVM to combine >> the two hardware raid6 volumes and mounted that at /mnt/pvfs2 on the >> servers. I then let pvfs do its magic. Config files below. >> >> 7(from second e-mail): Config file attached. >> >> ---------- >> /etc/pvfs2-fs.conf: >> ---------- >> [root@pvfs2-io-0-2 mnt]# cat /etc/pvfs2-fs.conf >> <Defaults> >> UnexpectedRequests 50 >> EventLogging none >> LogStamp datetime >> BMIModules bmi_tcp >> FlowModules flowproto_multiqueue >> PerfUpdateInterval 1000 >> ServerJobBMITimeoutSecs 30 >> ServerJobFlowTimeoutSecs 30 >> ClientJobBMITimeoutSecs 300 >> ClientJobFlowTimeoutSecs 300 >> ClientRetryLimit 5 >> ClientRetryDelayMilliSecs 2000 >> StorageSpace /mnt/pvfs2 >> LogFile /var/log/pvfs2-server.log >> </Defaults> >> >> <Aliases> >> Alias pvfs2-io-0-0 tcp://pvfs2-io-0-0:3334 >> Alias pvfs2-io-0-1 tcp://pvfs2-io-0-1:3334 >> Alias pvfs2-io-0-2 tcp://pvfs2-io-0-2:3334 >> </Aliases> >> >> <Filesystem> >> Name pvfs2-fs >> ID 62659950 >> RootHandle 1048576 >> <MetaHandleRanges> >> Range pvfs2-io-0-0 4-715827885 >> Range pvfs2-io-0-1 715827886-1431655767 >> Range pvfs2-io-0-2 1431655768-2147483649 >> </MetaHandleRanges> >> <DataHandleRanges> >> Range pvfs2-io-0-0 2147483650-2863311531 >> Range pvfs2-io-0-1 2863311532-3579139413 >> Range pvfs2-io-0-2 3579139414-4294967295 >> </DataHandleRanges> >> <StorageHints> >> TroveSyncMeta yes >> TroveSyncData no >> </StorageHints> >> </Filesystem> >> >> >> --------------------- >> /etc/pvfs2-server.conf-pvfs2-io-0-2 >> --------------------- >> StorageSpace /mnt/pvfs2 >> HostID "tcp://pvfs2-io-0-2:3334" >> LogFile /var/log/pvfs2-server.log >> --------------------- >> >> All the server config files are very similar. >> >> --Jim >> >> >> On Wed, Sep 28, 2011 at 4:45 PM, Michael Moore <[email protected]> >> wrote: >> > No doubt something is awry. Offhand I'm suspecting the network. A couple >> > things that might help give a direction: >> > 1) Do an end-to-end TCP test between client/server. Something like iperf >> > or >> > nuttcp should do the trick. >> > 2) Check server and client ethernet ports on the switch for high error >> > counts (not familiar with that switch, not sure if it's managed or not). >> > Hardware (port/cable) errors should show up in the above test. >> > 3) Can you mount the PVFS2 file system on the server and run some I/O >> > tests >> > (single datafile per file) to see if the network is in fact in play. >> > 4) What are the number of datafiles (by default) each file you're >> > writing to >> > is using? 3? >> > 5) When you watch network bandwidth and see 10 MB/s where is that? On >> > the >> > server? >> > 6) What backend are you using for I/O, direct or alt-aio. Nothing really >> > wrong either way, just wondering. >> > >> > It sounds like based on the dd output the disks are capable of more than >> > you're seeing, just need to narrow down where the performance is getting >> > squelched. >> > >> > Michael >> > >> > On Wed, Sep 28, 2011 at 6:10 PM, Jim Kusznir <[email protected]> wrote: >> >> >> >> Hi all: >> >> >> >> I've got a pvfs2 install on my cluster. I never felt it was >> >> performing up to snuff, but lately it seems that things have gone way, >> >> way down in total throughput and overall usability. To the tune that >> >> jobs writing out 900MB will take an extra 1-2 hours to complete due to >> >> disk I/O waits. A 2-hr job that would write about 30GB over the >> >> course of the run (normally about 2hrs long) takes up to 20hrs. Once >> >> the disk I/O is cut out, it completes in 1.5-2hrs. I've noticed >> >> personally that there's up to a 5 sec lag time when I cd into >> >> /mnt/pvfs2 and do an ls. Note that all of our operations are using >> >> the kernel module / mount point. Our problems and code base do not >> >> support the use of other tools (such as the pvfs2-* or the native MPI >> >> libraries); its all done through the kernel module / filesystem >> >> mountpoint. >> >> >> >> My configuration is this: 3 pvfs2 servers (Dell PowerEdge 1950's with >> >> 1.6Ghz quad-core CPUs, 4GB ram, raid-0 for metadata+os on perc5i >> >> card), Dell Perc6e card with hardware raid6 in two volumes: one on a >> >> bunch of 750GB sata drives, and the other on its second SAS connector >> >> to about 12 2tb WD drives. The two raid volumes are lvm'ed together >> >> in the OS and mounted as the pvfs2 data store. Each server is >> >> connected via ethernet to a stack of LG-errison gig-e switches >> >> (stack==2 switches with 40Gbit stacking cables installed). PVFS 2.8.2 >> >> used throughout the cluster on Rocks (using site-compiled pvfs, not >> >> the rocks-supplied pvfs). OSes are CentOS5-x-based (both clients and >> >> servers). >> >> >> >> As I said, I always felt something wasn't quite right, but a few >> >> months back, I performed a series of upgrades and reconfigurations on >> >> the infrastructure and hardware. Specifically, I upgraded to the >> >> lg-errison switches and replaced a full 12-bay drive shelf with a >> >> 24-bay one (moving all the disks through) and adding some additional >> >> disks. All three pvfs2 servers are identical in this. At some point >> >> prior to these changes, my users were able to get acceptable >> >> performance from pvfs2; now they are not. I don't have any evidence >> >> pointing to the switch or to the disks. >> >> >> >> I can run dd if=/dev/zero of=testfile bs=1024k count=10000 and get >> >> 380+MB/s locally on the pvfs server, writing to the partition on the >> >> hardware raid6 card. From a compute node, doing that for 100MB file, >> >> I get 47.7MB/s to my RAID-5 NFS server on the head node, and 36.5MB/s >> >> to my pvfs2 mounted share. When I watch the network >> >> bandwidth/throughput using bwm-ng, I rarely see more than 10MB/s, and >> >> often its around 4MB/s with a 12-node IO-bound job running. >> >> >> >> I originally had the pvfs2 servers connected to the switch with dual >> >> gig-e connections and using bonding (ALB) to make it more able to >> >> serve multiple nodes. I never saw anywhere close to the throughput I >> >> should. In any case, to test of that was the problem, I removed the >> >> bonding and am running through a single gig-e pipe now, but >> >> performance hasn't improved at all. >> >> >> >> I'm not sure how to troubleshoot this problem further. Presently, the >> >> cluster isn't usable for large I/O jobs, so I really have to fix this. >> >> >> >> --Jim >> >> _______________________________________________ >> >> Pvfs2-users mailing list >> >> [email protected] >> >> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users >> > >> > > > _______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
