I got some more information today. A user had run an I/O intensive job after the upgrades, and did not experience problems. However, that job now is. They think it might be due to increased job load, but it appears most of the jobs running are doing little I/O except theirs, so I'm skeptical.
As to watching the load on the servers, I have tried. I do watch top and occasionally will go down to the servers themselves and watch the disk I/O lights. I've watched network I/O (using bwm-ng). All of these indicate its loafing around at 10%-15% of capacity. I don't know how to watch actual IOPS or other more direct metrics. By every measurement I can do, it works fine as long as pvfs is not involved. Once one starts routing traffic through pvfs, performance drops to pathetic levels...I don't know how to go any further with this. It used to work better than it does, too. The problem occurs both on my head node or my compute nodes. (BTW: my head node gets reset/rebooted semi-frequently due to kernel OOPSes caused by the pvfs2 module, typically during times of high demand through the head node). --Jim On Tue, Oct 4, 2011 at 9:04 AM, Michael Moore <[email protected]> wrote: > On Tue, Oct 4, 2011 at 11:43 AM, Jim Kusznir <[email protected]> wrote: >> >> I didn't try the stripe size; I misinterpreted your suggestion. >> Setting that doesn't re-stripe existing data, does it? I think a lot >> of the I/O is reading existing data, then writing out some files. I >> don't think I have a good means of getting hard numbers / strong >> evaluation for production loads, as I don't have a viable means of >> separating out old from new. Just how long the run took, and I don't >> even know where all they're doing their I/O from. There's lots of >> different directories (one directory took several days to chmod >> recursively the files there). >> >> Is this the type of thing I can set on the root, and it will >> recursively fall down? > > When setting this class of attributes on a directory it applies to files in > that directory. It is not recursive and does not affect existing files. For > example, the output shown by pvfs2-viewdist won't change on existing files > if you set the strip_size on a directory. However, new files should use the > new strip_size. > >> >> Also, how do I determine the correct stripe size to use? > > You'll likely need to play around it. It can be influenced from the storage > hardware to the application on a client writing data. I typically focus on > the storage side for determining the stripe size but that's for a 'general > purpose' file system. It sounds like you may have specific jobs you may want > to tune to. > >> >> I'm still concerned about the fact that the performance dropped so >> notably after replacing the switch and adding more storage to the >> array. All of what you're pointing to sounds like it should have >> always been that bad... >> > > If the iperf tests you ran are representative of the traffic that occurs > when you see low performance then it indicates the network is likely not the > issue. However, if it's not (if I/O happens from compute, not your head node > for example) you may want to try from several compute nodes to see if > congestion dramatically reduces throughput or there are one or two trouble > nodes if collective I/O is being done. > > Watching disk and network performance on servers while low performing I/O is > occurring should indicate if the disk or network on one or more of the > servers is an issue. If not, then time to look at the clients closer. > > Others may have suggestions too, not meaning to prevent others from throwing > out ideas. > > Michael > > >> >> --Jim >> >> On Mon, Oct 3, 2011 at 11:54 AM, Michael Moore <[email protected]> >> wrote: >> > On Mon, Oct 3, 2011 at 2:38 PM, Jim Kusznir <[email protected]> wrote: >> >> >> >> All speeds were in Mpbs, the default from iperf. >> >> >> >> Our files are multi-GB in size, so they do involve all three servers. >> >> It applies to all files on the system. >> > >> > Okay, good, wanted to confirm. >> > >> >> >> >> Can I change the stripe size "on the go"? I already have about 50TB >> >> of data in the system, and have no place big enough to back it up to >> >> rebuild the pvfs2 array and restore.... >> > >> > Unfortunately, not that I know of. You can set the extended attributes, >> > mentioned previously, on all directories so new files will use a >> > different >> > stripe size. Ideally, the strip per server will be equal to some unit >> > that >> > your underlying file system and storage digest efficiently, like a RAID >> > stripe. Did a larger stripe improve your observed throughput? >> > >> > Michael >> > >> >> >> >> >> >> >> >> --Jim >> >> >> >> On Fri, Sep 30, 2011 at 1:46 PM, Michael Moore <[email protected]> >> >> wrote: >> >> > See below for specific items. Can you run iostat on the servers while >> >> > writing a file that experiences the slow performance? If you could >> >> > watch >> >> > iostat -dmx <device of pvfs storage space> and provide any salient >> >> > snippets >> >> > (high utilization, low utilization, odd looking output, etc) that >> >> > could >> >> > help. >> >> > >> >> > On Thu, Sep 29, 2011 at 11:42 AM, Jim Kusznir <[email protected]> >> >> > wrote: >> >> >> >> >> >> 1) iperf (defaults) reported 873, 884, and 929 for connections form >> >> >> the three servers to the head node (a pvfs2 client) >> >> > >> >> > Just to be clear, those are Mbps, right? >> >> > >> >> >> >> >> >> 2) no errors showed up on any of the ports on the managed switch. >> >> > >> >> > Hmm, if those are Mbps not seeming to be a network layer. >> >> >> >> >> >> 3) I'm not sure what this will do, as the pvfs2 volume is comprised >> >> >> of >> >> >> 3 servers, so mounting it on a server still uses the network for the >> >> >> other two. I also don't understand "single file per datafile" >> >> >> statement. In any case, I do not have the kernel module compiled on >> >> >> my servers; they ONLY have the pvfs2 server software installed. >> >> >> >> >> > >> >> > A logical file (e.g. foo.out) in a PVFS2 file system is made up of >> >> > one >> >> > or >> >> > more datafiles. Based on your config I would assume most are made up >> >> > of >> >> > 3 >> >> > datafiles with the default stripe size of 64k. >> >> > >> >> > You can run pvfs2-viewdist -f <file name> to see what the >> >> > distribution >> >> > and >> >> > what servers a given file lives on. To see cumulative throughput from >> >> > multiple PVFS2 servers the number of datafiles must be greater than >> >> > one. >> >> > Check a couple of the problematic files to see what their >> >> > distribution >> >> > is. >> >> > >> >> > For a quick test to see if the distribution is impacting performance >> >> > set >> >> > the >> >> > following extended attribute on a directory and then check the >> >> > performance >> >> > of writing a file into it: >> >> > setfattr -n user.pvfs2.num_dfiles -v "3" <some pvfs2 dir> >> >> > >> >> > Also, you can test if a larger strip_size would help doing a >> >> > something >> >> > similar to (for 256k strip): >> >> > setfattr -n user.pvfs2.dist_name -v simple_stripe <some pvfs2 dir> >> >> > setfattr -n user.pvfs2.dist_params -v strip_size:262144 <some pvfs2 >> >> > dir> >> >> > >> >> >> >> >> >> 4) I'm not sure; I used largely defaults. I've attached my config >> >> >> below. >> >> >> >> >> >> 5) the network bandwidth is on one of the servers (the one I >> >> >> checked; >> >> >> I believe them to all be similar). >> >> >> >> >> >> 6) Not sure. I have created an XFS filesystem using LVM to combine >> >> >> the two hardware raid6 volumes and mounted that at /mnt/pvfs2 on the >> >> >> servers. I then let pvfs do its magic. Config files below. >> >> >> >> >> >> 7(from second e-mail): Config file attached. >> >> >> >> >> >> ---------- >> >> >> /etc/pvfs2-fs.conf: >> >> >> ---------- >> >> >> [root@pvfs2-io-0-2 mnt]# cat /etc/pvfs2-fs.conf >> >> >> <Defaults> >> >> >> UnexpectedRequests 50 >> >> >> EventLogging none >> >> >> LogStamp datetime >> >> >> BMIModules bmi_tcp >> >> >> FlowModules flowproto_multiqueue >> >> >> PerfUpdateInterval 1000 >> >> >> ServerJobBMITimeoutSecs 30 >> >> >> ServerJobFlowTimeoutSecs 30 >> >> >> ClientJobBMITimeoutSecs 300 >> >> >> ClientJobFlowTimeoutSecs 300 >> >> >> ClientRetryLimit 5 >> >> >> ClientRetryDelayMilliSecs 2000 >> >> >> StorageSpace /mnt/pvfs2 >> >> >> LogFile /var/log/pvfs2-server.log >> >> >> </Defaults> >> >> >> >> >> >> <Aliases> >> >> >> Alias pvfs2-io-0-0 tcp://pvfs2-io-0-0:3334 >> >> >> Alias pvfs2-io-0-1 tcp://pvfs2-io-0-1:3334 >> >> >> Alias pvfs2-io-0-2 tcp://pvfs2-io-0-2:3334 >> >> >> </Aliases> >> >> >> >> >> >> <Filesystem> >> >> >> Name pvfs2-fs >> >> >> ID 62659950 >> >> >> RootHandle 1048576 >> >> >> <MetaHandleRanges> >> >> >> Range pvfs2-io-0-0 4-715827885 >> >> >> Range pvfs2-io-0-1 715827886-1431655767 >> >> >> Range pvfs2-io-0-2 1431655768-2147483649 >> >> >> </MetaHandleRanges> >> >> >> <DataHandleRanges> >> >> >> Range pvfs2-io-0-0 2147483650-2863311531 >> >> >> Range pvfs2-io-0-1 2863311532-3579139413 >> >> >> Range pvfs2-io-0-2 3579139414-4294967295 >> >> >> </DataHandleRanges> >> >> >> <StorageHints> >> >> >> TroveSyncMeta yes >> >> >> TroveSyncData no >> >> >> </StorageHints> >> >> >> </Filesystem> >> >> >> >> >> >> >> >> >> --------------------- >> >> >> /etc/pvfs2-server.conf-pvfs2-io-0-2 >> >> >> --------------------- >> >> >> StorageSpace /mnt/pvfs2 >> >> >> HostID "tcp://pvfs2-io-0-2:3334" >> >> >> LogFile /var/log/pvfs2-server.log >> >> >> --------------------- >> >> >> >> >> >> All the server config files are very similar. >> >> >> >> >> >> --Jim >> >> >> >> >> >> >> >> >> On Wed, Sep 28, 2011 at 4:45 PM, Michael Moore >> >> >> <[email protected]> >> >> >> wrote: >> >> >> > No doubt something is awry. Offhand I'm suspecting the network. A >> >> >> > couple >> >> >> > things that might help give a direction: >> >> >> > 1) Do an end-to-end TCP test between client/server. Something like >> >> >> > iperf >> >> >> > or >> >> >> > nuttcp should do the trick. >> >> >> > 2) Check server and client ethernet ports on the switch for high >> >> >> > error >> >> >> > counts (not familiar with that switch, not sure if it's managed or >> >> >> > not). >> >> >> > Hardware (port/cable) errors should show up in the above test. >> >> >> > 3) Can you mount the PVFS2 file system on the server and run some >> >> >> > I/O >> >> >> > tests >> >> >> > (single datafile per file) to see if the network is in fact in >> >> >> > play. >> >> >> > 4) What are the number of datafiles (by default) each file you're >> >> >> > writing to >> >> >> > is using? 3? >> >> >> > 5) When you watch network bandwidth and see 10 MB/s where is that? >> >> >> > On >> >> >> > the >> >> >> > server? >> >> >> > 6) What backend are you using for I/O, direct or alt-aio. Nothing >> >> >> > really >> >> >> > wrong either way, just wondering. >> >> >> > >> >> >> > It sounds like based on the dd output the disks are capable of >> >> >> > more >> >> >> > than >> >> >> > you're seeing, just need to narrow down where the performance is >> >> >> > getting >> >> >> > squelched. >> >> >> > >> >> >> > Michael >> >> >> > >> >> >> > On Wed, Sep 28, 2011 at 6:10 PM, Jim Kusznir <[email protected]> >> >> >> > wrote: >> >> >> >> >> >> >> >> Hi all: >> >> >> >> >> >> >> >> I've got a pvfs2 install on my cluster. I never felt it was >> >> >> >> performing up to snuff, but lately it seems that things have gone >> >> >> >> way, >> >> >> >> way down in total throughput and overall usability. To the tune >> >> >> >> that >> >> >> >> jobs writing out 900MB will take an extra 1-2 hours to complete >> >> >> >> due >> >> >> >> to >> >> >> >> disk I/O waits. A 2-hr job that would write about 30GB over the >> >> >> >> course of the run (normally about 2hrs long) takes up to 20hrs. >> >> >> >> Once >> >> >> >> the disk I/O is cut out, it completes in 1.5-2hrs. I've noticed >> >> >> >> personally that there's up to a 5 sec lag time when I cd into >> >> >> >> /mnt/pvfs2 and do an ls. Note that all of our operations are >> >> >> >> using >> >> >> >> the kernel module / mount point. Our problems and code base do >> >> >> >> not >> >> >> >> support the use of other tools (such as the pvfs2-* or the native >> >> >> >> MPI >> >> >> >> libraries); its all done through the kernel module / filesystem >> >> >> >> mountpoint. >> >> >> >> >> >> >> >> My configuration is this: 3 pvfs2 servers (Dell PowerEdge 1950's >> >> >> >> with >> >> >> >> 1.6Ghz quad-core CPUs, 4GB ram, raid-0 for metadata+os on perc5i >> >> >> >> card), Dell Perc6e card with hardware raid6 in two volumes: one >> >> >> >> on a >> >> >> >> bunch of 750GB sata drives, and the other on its second SAS >> >> >> >> connector >> >> >> >> to about 12 2tb WD drives. The two raid volumes are lvm'ed >> >> >> >> together >> >> >> >> in the OS and mounted as the pvfs2 data store. Each server is >> >> >> >> connected via ethernet to a stack of LG-errison gig-e switches >> >> >> >> (stack==2 switches with 40Gbit stacking cables installed). PVFS >> >> >> >> 2.8.2 >> >> >> >> used throughout the cluster on Rocks (using site-compiled pvfs, >> >> >> >> not >> >> >> >> the rocks-supplied pvfs). OSes are CentOS5-x-based (both clients >> >> >> >> and >> >> >> >> servers). >> >> >> >> >> >> >> >> As I said, I always felt something wasn't quite right, but a few >> >> >> >> months back, I performed a series of upgrades and >> >> >> >> reconfigurations >> >> >> >> on >> >> >> >> the infrastructure and hardware. Specifically, I upgraded to the >> >> >> >> lg-errison switches and replaced a full 12-bay drive shelf with a >> >> >> >> 24-bay one (moving all the disks through) and adding some >> >> >> >> additional >> >> >> >> disks. All three pvfs2 servers are identical in this. At some >> >> >> >> point >> >> >> >> prior to these changes, my users were able to get acceptable >> >> >> >> performance from pvfs2; now they are not. I don't have any >> >> >> >> evidence >> >> >> >> pointing to the switch or to the disks. >> >> >> >> >> >> >> >> I can run dd if=/dev/zero of=testfile bs=1024k count=10000 and >> >> >> >> get >> >> >> >> 380+MB/s locally on the pvfs server, writing to the partition on >> >> >> >> the >> >> >> >> hardware raid6 card. From a compute node, doing that for 100MB >> >> >> >> file, >> >> >> >> I get 47.7MB/s to my RAID-5 NFS server on the head node, and >> >> >> >> 36.5MB/s >> >> >> >> to my pvfs2 mounted share. When I watch the network >> >> >> >> bandwidth/throughput using bwm-ng, I rarely see more than 10MB/s, >> >> >> >> and >> >> >> >> often its around 4MB/s with a 12-node IO-bound job running. >> >> >> >> >> >> >> >> I originally had the pvfs2 servers connected to the switch with >> >> >> >> dual >> >> >> >> gig-e connections and using bonding (ALB) to make it more able to >> >> >> >> serve multiple nodes. I never saw anywhere close to the >> >> >> >> throughput >> >> >> >> I >> >> >> >> should. In any case, to test of that was the problem, I removed >> >> >> >> the >> >> >> >> bonding and am running through a single gig-e pipe now, but >> >> >> >> performance hasn't improved at all. >> >> >> >> >> >> >> >> I'm not sure how to troubleshoot this problem further. >> >> >> >> Presently, >> >> >> >> the >> >> >> >> cluster isn't usable for large I/O jobs, so I really have to fix >> >> >> >> this. >> >> >> >> >> >> >> >> --Jim >> >> >> >> _______________________________________________ >> >> >> >> Pvfs2-users mailing list >> >> >> >> [email protected] >> >> >> >> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users >> >> >> > >> >> >> > >> >> > >> >> > >> > >> > > > _______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
