On Mon, Oct 3, 2011 at 2:38 PM, Jim Kusznir <[email protected]> wrote:
> All speeds were in Mpbs, the default from iperf. > > Our files are multi-GB in size, so they do involve all three servers. > It applies to all files on the system. > Okay, good, wanted to confirm. > > Can I change the stripe size "on the go"? I already have about 50TB > of data in the system, and have no place big enough to back it up to > rebuild the pvfs2 array and restore.... > Unfortunately, not that I know of. You can set the extended attributes, mentioned previously, on all directories so new files will use a different stripe size. Ideally, the strip per server will be equal to some unit that your underlying file system and storage digest efficiently, like a RAID stripe. Did a larger stripe improve your observed throughput? Michael > > --Jim > > On Fri, Sep 30, 2011 at 1:46 PM, Michael Moore <[email protected]> > wrote: > > See below for specific items. Can you run iostat on the servers while > > writing a file that experiences the slow performance? If you could watch > > iostat -dmx <device of pvfs storage space> and provide any salient > snippets > > (high utilization, low utilization, odd looking output, etc) that could > > help. > > > > On Thu, Sep 29, 2011 at 11:42 AM, Jim Kusznir <[email protected]> > wrote: > >> > >> 1) iperf (defaults) reported 873, 884, and 929 for connections form > >> the three servers to the head node (a pvfs2 client) > > > > Just to be clear, those are Mbps, right? > > > >> > >> 2) no errors showed up on any of the ports on the managed switch. > > > > Hmm, if those are Mbps not seeming to be a network layer. > >> > >> 3) I'm not sure what this will do, as the pvfs2 volume is comprised of > >> 3 servers, so mounting it on a server still uses the network for the > >> other two. I also don't understand "single file per datafile" > >> statement. In any case, I do not have the kernel module compiled on > >> my servers; they ONLY have the pvfs2 server software installed. > >> > > > > A logical file (e.g. foo.out) in a PVFS2 file system is made up of one or > > more datafiles. Based on your config I would assume most are made up of 3 > > datafiles with the default stripe size of 64k. > > > > You can run pvfs2-viewdist -f <file name> to see what the distribution > and > > what servers a given file lives on. To see cumulative throughput from > > multiple PVFS2 servers the number of datafiles must be greater than one. > > Check a couple of the problematic files to see what their distribution > is. > > > > For a quick test to see if the distribution is impacting performance set > the > > following extended attribute on a directory and then check the > performance > > of writing a file into it: > > setfattr -n user.pvfs2.num_dfiles -v "3" <some pvfs2 dir> > > > > Also, you can test if a larger strip_size would help doing a something > > similar to (for 256k strip): > > setfattr -n user.pvfs2.dist_name -v simple_stripe <some pvfs2 dir> > > setfattr -n user.pvfs2.dist_params -v strip_size:262144 <some pvfs2 dir> > > > >> > >> 4) I'm not sure; I used largely defaults. I've attached my config > below. > >> > >> 5) the network bandwidth is on one of the servers (the one I checked; > >> I believe them to all be similar). > >> > >> 6) Not sure. I have created an XFS filesystem using LVM to combine > >> the two hardware raid6 volumes and mounted that at /mnt/pvfs2 on the > >> servers. I then let pvfs do its magic. Config files below. > >> > >> 7(from second e-mail): Config file attached. > >> > >> ---------- > >> /etc/pvfs2-fs.conf: > >> ---------- > >> [root@pvfs2-io-0-2 mnt]# cat /etc/pvfs2-fs.conf > >> <Defaults> > >> UnexpectedRequests 50 > >> EventLogging none > >> LogStamp datetime > >> BMIModules bmi_tcp > >> FlowModules flowproto_multiqueue > >> PerfUpdateInterval 1000 > >> ServerJobBMITimeoutSecs 30 > >> ServerJobFlowTimeoutSecs 30 > >> ClientJobBMITimeoutSecs 300 > >> ClientJobFlowTimeoutSecs 300 > >> ClientRetryLimit 5 > >> ClientRetryDelayMilliSecs 2000 > >> StorageSpace /mnt/pvfs2 > >> LogFile /var/log/pvfs2-server.log > >> </Defaults> > >> > >> <Aliases> > >> Alias pvfs2-io-0-0 tcp://pvfs2-io-0-0:3334 > >> Alias pvfs2-io-0-1 tcp://pvfs2-io-0-1:3334 > >> Alias pvfs2-io-0-2 tcp://pvfs2-io-0-2:3334 > >> </Aliases> > >> > >> <Filesystem> > >> Name pvfs2-fs > >> ID 62659950 > >> RootHandle 1048576 > >> <MetaHandleRanges> > >> Range pvfs2-io-0-0 4-715827885 > >> Range pvfs2-io-0-1 715827886-1431655767 > >> Range pvfs2-io-0-2 1431655768-2147483649 > >> </MetaHandleRanges> > >> <DataHandleRanges> > >> Range pvfs2-io-0-0 2147483650-2863311531 > >> Range pvfs2-io-0-1 2863311532-3579139413 > >> Range pvfs2-io-0-2 3579139414-4294967295 > >> </DataHandleRanges> > >> <StorageHints> > >> TroveSyncMeta yes > >> TroveSyncData no > >> </StorageHints> > >> </Filesystem> > >> > >> > >> --------------------- > >> /etc/pvfs2-server.conf-pvfs2-io-0-2 > >> --------------------- > >> StorageSpace /mnt/pvfs2 > >> HostID "tcp://pvfs2-io-0-2:3334" > >> LogFile /var/log/pvfs2-server.log > >> --------------------- > >> > >> All the server config files are very similar. > >> > >> --Jim > >> > >> > >> On Wed, Sep 28, 2011 at 4:45 PM, Michael Moore <[email protected]> > >> wrote: > >> > No doubt something is awry. Offhand I'm suspecting the network. A > couple > >> > things that might help give a direction: > >> > 1) Do an end-to-end TCP test between client/server. Something like > iperf > >> > or > >> > nuttcp should do the trick. > >> > 2) Check server and client ethernet ports on the switch for high error > >> > counts (not familiar with that switch, not sure if it's managed or > not). > >> > Hardware (port/cable) errors should show up in the above test. > >> > 3) Can you mount the PVFS2 file system on the server and run some I/O > >> > tests > >> > (single datafile per file) to see if the network is in fact in play. > >> > 4) What are the number of datafiles (by default) each file you're > >> > writing to > >> > is using? 3? > >> > 5) When you watch network bandwidth and see 10 MB/s where is that? On > >> > the > >> > server? > >> > 6) What backend are you using for I/O, direct or alt-aio. Nothing > really > >> > wrong either way, just wondering. > >> > > >> > It sounds like based on the dd output the disks are capable of more > than > >> > you're seeing, just need to narrow down where the performance is > getting > >> > squelched. > >> > > >> > Michael > >> > > >> > On Wed, Sep 28, 2011 at 6:10 PM, Jim Kusznir <[email protected]> > wrote: > >> >> > >> >> Hi all: > >> >> > >> >> I've got a pvfs2 install on my cluster. I never felt it was > >> >> performing up to snuff, but lately it seems that things have gone > way, > >> >> way down in total throughput and overall usability. To the tune that > >> >> jobs writing out 900MB will take an extra 1-2 hours to complete due > to > >> >> disk I/O waits. A 2-hr job that would write about 30GB over the > >> >> course of the run (normally about 2hrs long) takes up to 20hrs. Once > >> >> the disk I/O is cut out, it completes in 1.5-2hrs. I've noticed > >> >> personally that there's up to a 5 sec lag time when I cd into > >> >> /mnt/pvfs2 and do an ls. Note that all of our operations are using > >> >> the kernel module / mount point. Our problems and code base do not > >> >> support the use of other tools (such as the pvfs2-* or the native MPI > >> >> libraries); its all done through the kernel module / filesystem > >> >> mountpoint. > >> >> > >> >> My configuration is this: 3 pvfs2 servers (Dell PowerEdge 1950's > with > >> >> 1.6Ghz quad-core CPUs, 4GB ram, raid-0 for metadata+os on perc5i > >> >> card), Dell Perc6e card with hardware raid6 in two volumes: one on a > >> >> bunch of 750GB sata drives, and the other on its second SAS connector > >> >> to about 12 2tb WD drives. The two raid volumes are lvm'ed together > >> >> in the OS and mounted as the pvfs2 data store. Each server is > >> >> connected via ethernet to a stack of LG-errison gig-e switches > >> >> (stack==2 switches with 40Gbit stacking cables installed). PVFS > 2.8.2 > >> >> used throughout the cluster on Rocks (using site-compiled pvfs, not > >> >> the rocks-supplied pvfs). OSes are CentOS5-x-based (both clients and > >> >> servers). > >> >> > >> >> As I said, I always felt something wasn't quite right, but a few > >> >> months back, I performed a series of upgrades and reconfigurations on > >> >> the infrastructure and hardware. Specifically, I upgraded to the > >> >> lg-errison switches and replaced a full 12-bay drive shelf with a > >> >> 24-bay one (moving all the disks through) and adding some additional > >> >> disks. All three pvfs2 servers are identical in this. At some point > >> >> prior to these changes, my users were able to get acceptable > >> >> performance from pvfs2; now they are not. I don't have any evidence > >> >> pointing to the switch or to the disks. > >> >> > >> >> I can run dd if=/dev/zero of=testfile bs=1024k count=10000 and get > >> >> 380+MB/s locally on the pvfs server, writing to the partition on the > >> >> hardware raid6 card. From a compute node, doing that for 100MB file, > >> >> I get 47.7MB/s to my RAID-5 NFS server on the head node, and 36.5MB/s > >> >> to my pvfs2 mounted share. When I watch the network > >> >> bandwidth/throughput using bwm-ng, I rarely see more than 10MB/s, and > >> >> often its around 4MB/s with a 12-node IO-bound job running. > >> >> > >> >> I originally had the pvfs2 servers connected to the switch with dual > >> >> gig-e connections and using bonding (ALB) to make it more able to > >> >> serve multiple nodes. I never saw anywhere close to the throughput I > >> >> should. In any case, to test of that was the problem, I removed the > >> >> bonding and am running through a single gig-e pipe now, but > >> >> performance hasn't improved at all. > >> >> > >> >> I'm not sure how to troubleshoot this problem further. Presently, > the > >> >> cluster isn't usable for large I/O jobs, so I really have to fix > this. > >> >> > >> >> --Jim > >> >> _______________________________________________ > >> >> Pvfs2-users mailing list > >> >> [email protected] > >> >> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users > >> > > >> > > > > > >
_______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
