Another question: we've expanded our pvfs2 disk storage (uniformly) twice now. Do I need to run some form of "defrag" or other optimizer?
--Jim On Mon, Oct 3, 2011 at 11:38 AM, Jim Kusznir <[email protected]> wrote: > All speeds were in Mpbs, the default from iperf. > > Our files are multi-GB in size, so they do involve all three servers. > It applies to all files on the system. > > Can I change the stripe size "on the go"? I already have about 50TB > of data in the system, and have no place big enough to back it up to > rebuild the pvfs2 array and restore.... > > --Jim > > On Fri, Sep 30, 2011 at 1:46 PM, Michael Moore <[email protected]> wrote: >> See below for specific items. Can you run iostat on the servers while >> writing a file that experiences the slow performance? If you could watch >> iostat -dmx <device of pvfs storage space> and provide any salient snippets >> (high utilization, low utilization, odd looking output, etc) that could >> help. >> >> On Thu, Sep 29, 2011 at 11:42 AM, Jim Kusznir <[email protected]> wrote: >>> >>> 1) iperf (defaults) reported 873, 884, and 929 for connections form >>> the three servers to the head node (a pvfs2 client) >> >> Just to be clear, those are Mbps, right? >> >>> >>> 2) no errors showed up on any of the ports on the managed switch. >> >> Hmm, if those are Mbps not seeming to be a network layer. >>> >>> 3) I'm not sure what this will do, as the pvfs2 volume is comprised of >>> 3 servers, so mounting it on a server still uses the network for the >>> other two. I also don't understand "single file per datafile" >>> statement. In any case, I do not have the kernel module compiled on >>> my servers; they ONLY have the pvfs2 server software installed. >>> >> >> A logical file (e.g. foo.out) in a PVFS2 file system is made up of one or >> more datafiles. Based on your config I would assume most are made up of 3 >> datafiles with the default stripe size of 64k. >> >> You can run pvfs2-viewdist -f <file name> to see what the distribution and >> what servers a given file lives on. To see cumulative throughput from >> multiple PVFS2 servers the number of datafiles must be greater than one. >> Check a couple of the problematic files to see what their distribution is. >> >> For a quick test to see if the distribution is impacting performance set the >> following extended attribute on a directory and then check the performance >> of writing a file into it: >> setfattr -n user.pvfs2.num_dfiles -v "3" <some pvfs2 dir> >> >> Also, you can test if a larger strip_size would help doing a something >> similar to (for 256k strip): >> setfattr -n user.pvfs2.dist_name -v simple_stripe <some pvfs2 dir> >> setfattr -n user.pvfs2.dist_params -v strip_size:262144 <some pvfs2 dir> >> >>> >>> 4) I'm not sure; I used largely defaults. I've attached my config below. >>> >>> 5) the network bandwidth is on one of the servers (the one I checked; >>> I believe them to all be similar). >>> >>> 6) Not sure. I have created an XFS filesystem using LVM to combine >>> the two hardware raid6 volumes and mounted that at /mnt/pvfs2 on the >>> servers. I then let pvfs do its magic. Config files below. >>> >>> 7(from second e-mail): Config file attached. >>> >>> ---------- >>> /etc/pvfs2-fs.conf: >>> ---------- >>> [root@pvfs2-io-0-2 mnt]# cat /etc/pvfs2-fs.conf >>> <Defaults> >>> UnexpectedRequests 50 >>> EventLogging none >>> LogStamp datetime >>> BMIModules bmi_tcp >>> FlowModules flowproto_multiqueue >>> PerfUpdateInterval 1000 >>> ServerJobBMITimeoutSecs 30 >>> ServerJobFlowTimeoutSecs 30 >>> ClientJobBMITimeoutSecs 300 >>> ClientJobFlowTimeoutSecs 300 >>> ClientRetryLimit 5 >>> ClientRetryDelayMilliSecs 2000 >>> StorageSpace /mnt/pvfs2 >>> LogFile /var/log/pvfs2-server.log >>> </Defaults> >>> >>> <Aliases> >>> Alias pvfs2-io-0-0 tcp://pvfs2-io-0-0:3334 >>> Alias pvfs2-io-0-1 tcp://pvfs2-io-0-1:3334 >>> Alias pvfs2-io-0-2 tcp://pvfs2-io-0-2:3334 >>> </Aliases> >>> >>> <Filesystem> >>> Name pvfs2-fs >>> ID 62659950 >>> RootHandle 1048576 >>> <MetaHandleRanges> >>> Range pvfs2-io-0-0 4-715827885 >>> Range pvfs2-io-0-1 715827886-1431655767 >>> Range pvfs2-io-0-2 1431655768-2147483649 >>> </MetaHandleRanges> >>> <DataHandleRanges> >>> Range pvfs2-io-0-0 2147483650-2863311531 >>> Range pvfs2-io-0-1 2863311532-3579139413 >>> Range pvfs2-io-0-2 3579139414-4294967295 >>> </DataHandleRanges> >>> <StorageHints> >>> TroveSyncMeta yes >>> TroveSyncData no >>> </StorageHints> >>> </Filesystem> >>> >>> >>> --------------------- >>> /etc/pvfs2-server.conf-pvfs2-io-0-2 >>> --------------------- >>> StorageSpace /mnt/pvfs2 >>> HostID "tcp://pvfs2-io-0-2:3334" >>> LogFile /var/log/pvfs2-server.log >>> --------------------- >>> >>> All the server config files are very similar. >>> >>> --Jim >>> >>> >>> On Wed, Sep 28, 2011 at 4:45 PM, Michael Moore <[email protected]> >>> wrote: >>> > No doubt something is awry. Offhand I'm suspecting the network. A couple >>> > things that might help give a direction: >>> > 1) Do an end-to-end TCP test between client/server. Something like iperf >>> > or >>> > nuttcp should do the trick. >>> > 2) Check server and client ethernet ports on the switch for high error >>> > counts (not familiar with that switch, not sure if it's managed or not). >>> > Hardware (port/cable) errors should show up in the above test. >>> > 3) Can you mount the PVFS2 file system on the server and run some I/O >>> > tests >>> > (single datafile per file) to see if the network is in fact in play. >>> > 4) What are the number of datafiles (by default) each file you're >>> > writing to >>> > is using? 3? >>> > 5) When you watch network bandwidth and see 10 MB/s where is that? On >>> > the >>> > server? >>> > 6) What backend are you using for I/O, direct or alt-aio. Nothing really >>> > wrong either way, just wondering. >>> > >>> > It sounds like based on the dd output the disks are capable of more than >>> > you're seeing, just need to narrow down where the performance is getting >>> > squelched. >>> > >>> > Michael >>> > >>> > On Wed, Sep 28, 2011 at 6:10 PM, Jim Kusznir <[email protected]> wrote: >>> >> >>> >> Hi all: >>> >> >>> >> I've got a pvfs2 install on my cluster. I never felt it was >>> >> performing up to snuff, but lately it seems that things have gone way, >>> >> way down in total throughput and overall usability. To the tune that >>> >> jobs writing out 900MB will take an extra 1-2 hours to complete due to >>> >> disk I/O waits. A 2-hr job that would write about 30GB over the >>> >> course of the run (normally about 2hrs long) takes up to 20hrs. Once >>> >> the disk I/O is cut out, it completes in 1.5-2hrs. I've noticed >>> >> personally that there's up to a 5 sec lag time when I cd into >>> >> /mnt/pvfs2 and do an ls. Note that all of our operations are using >>> >> the kernel module / mount point. Our problems and code base do not >>> >> support the use of other tools (such as the pvfs2-* or the native MPI >>> >> libraries); its all done through the kernel module / filesystem >>> >> mountpoint. >>> >> >>> >> My configuration is this: 3 pvfs2 servers (Dell PowerEdge 1950's with >>> >> 1.6Ghz quad-core CPUs, 4GB ram, raid-0 for metadata+os on perc5i >>> >> card), Dell Perc6e card with hardware raid6 in two volumes: one on a >>> >> bunch of 750GB sata drives, and the other on its second SAS connector >>> >> to about 12 2tb WD drives. The two raid volumes are lvm'ed together >>> >> in the OS and mounted as the pvfs2 data store. Each server is >>> >> connected via ethernet to a stack of LG-errison gig-e switches >>> >> (stack==2 switches with 40Gbit stacking cables installed). PVFS 2.8.2 >>> >> used throughout the cluster on Rocks (using site-compiled pvfs, not >>> >> the rocks-supplied pvfs). OSes are CentOS5-x-based (both clients and >>> >> servers). >>> >> >>> >> As I said, I always felt something wasn't quite right, but a few >>> >> months back, I performed a series of upgrades and reconfigurations on >>> >> the infrastructure and hardware. Specifically, I upgraded to the >>> >> lg-errison switches and replaced a full 12-bay drive shelf with a >>> >> 24-bay one (moving all the disks through) and adding some additional >>> >> disks. All three pvfs2 servers are identical in this. At some point >>> >> prior to these changes, my users were able to get acceptable >>> >> performance from pvfs2; now they are not. I don't have any evidence >>> >> pointing to the switch or to the disks. >>> >> >>> >> I can run dd if=/dev/zero of=testfile bs=1024k count=10000 and get >>> >> 380+MB/s locally on the pvfs server, writing to the partition on the >>> >> hardware raid6 card. From a compute node, doing that for 100MB file, >>> >> I get 47.7MB/s to my RAID-5 NFS server on the head node, and 36.5MB/s >>> >> to my pvfs2 mounted share. When I watch the network >>> >> bandwidth/throughput using bwm-ng, I rarely see more than 10MB/s, and >>> >> often its around 4MB/s with a 12-node IO-bound job running. >>> >> >>> >> I originally had the pvfs2 servers connected to the switch with dual >>> >> gig-e connections and using bonding (ALB) to make it more able to >>> >> serve multiple nodes. I never saw anywhere close to the throughput I >>> >> should. In any case, to test of that was the problem, I removed the >>> >> bonding and am running through a single gig-e pipe now, but >>> >> performance hasn't improved at all. >>> >> >>> >> I'm not sure how to troubleshoot this problem further. Presently, the >>> >> cluster isn't usable for large I/O jobs, so I really have to fix this. >>> >> >>> >> --Jim >>> >> _______________________________________________ >>> >> Pvfs2-users mailing list >>> >> [email protected] >>> >> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users >>> > >>> > >> >> > _______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
