On Wed, Oct 5, 2011 at 11:44 AM, Jim Kusznir <[email protected]> wrote:
> I got some more information today. A user had run an I/O intensive > job after the upgrades, and did not experience problems. However, > that job now is. They think it might be due to increased job load, > but it appears most of the jobs running are doing little I/O except > theirs, so I'm skeptical. > > As to watching the load on the servers, I have tried. I do watch top > and occasionally will go down to the servers themselves and watch the > disk I/O lights. I've watched network I/O (using bwm-ng). All of > these indicate its loafing around at 10%-15% of capacity. I don't > know how to watch actual IOPS or other more direct metrics. > Specifically, I was wanting to see iostat output. It will show disk throughput, disk request size, IOPS, etc. > By every measurement I can do, it works fine as long as pvfs is not > involved. Once one starts routing traffic through pvfs, performance > drops to pathetic levels...I don't know how to go any further with > this. It used to work better than it does, too. > > The problem occurs both on my head node or my compute nodes. (BTW: my > head node gets reset/rebooted semi-frequently due to kernel OOPSes > caused by the pvfs2 module, typically during times of high demand > through the head node). > There have been several fixes for various kernel panics since 2.8.2, one related to pvfs2 kernel memory being paged out may relate to your issue. Michael --Jim > > On Tue, Oct 4, 2011 at 9:04 AM, Michael Moore <[email protected]> > wrote: > > On Tue, Oct 4, 2011 at 11:43 AM, Jim Kusznir <[email protected]> wrote: > >> > >> I didn't try the stripe size; I misinterpreted your suggestion. > >> Setting that doesn't re-stripe existing data, does it? I think a lot > >> of the I/O is reading existing data, then writing out some files. I > >> don't think I have a good means of getting hard numbers / strong > >> evaluation for production loads, as I don't have a viable means of > >> separating out old from new. Just how long the run took, and I don't > >> even know where all they're doing their I/O from. There's lots of > >> different directories (one directory took several days to chmod > >> recursively the files there). > >> > >> Is this the type of thing I can set on the root, and it will > >> recursively fall down? > > > > When setting this class of attributes on a directory it applies to files > in > > that directory. It is not recursive and does not affect existing files. > For > > example, the output shown by pvfs2-viewdist won't change on existing > files > > if you set the strip_size on a directory. However, new files should use > the > > new strip_size. > > > >> > >> Also, how do I determine the correct stripe size to use? > > > > You'll likely need to play around it. It can be influenced from the > storage > > hardware to the application on a client writing data. I typically focus > on > > the storage side for determining the stripe size but that's for a > 'general > > purpose' file system. It sounds like you may have specific jobs you may > want > > to tune to. > > > >> > >> I'm still concerned about the fact that the performance dropped so > >> notably after replacing the switch and adding more storage to the > >> array. All of what you're pointing to sounds like it should have > >> always been that bad... > >> > > > > If the iperf tests you ran are representative of the traffic that occurs > > when you see low performance then it indicates the network is likely not > the > > issue. However, if it's not (if I/O happens from compute, not your head > node > > for example) you may want to try from several compute nodes to see if > > congestion dramatically reduces throughput or there are one or two > trouble > > nodes if collective I/O is being done. > > > > Watching disk and network performance on servers while low performing I/O > is > > occurring should indicate if the disk or network on one or more of the > > servers is an issue. If not, then time to look at the clients closer. > > > > Others may have suggestions too, not meaning to prevent others from > throwing > > out ideas. > > > > Michael > > > > > >> > >> --Jim > >> > >> On Mon, Oct 3, 2011 at 11:54 AM, Michael Moore <[email protected]> > >> wrote: > >> > On Mon, Oct 3, 2011 at 2:38 PM, Jim Kusznir <[email protected]> > wrote: > >> >> > >> >> All speeds were in Mpbs, the default from iperf. > >> >> > >> >> Our files are multi-GB in size, so they do involve all three servers. > >> >> It applies to all files on the system. > >> > > >> > Okay, good, wanted to confirm. > >> > > >> >> > >> >> Can I change the stripe size "on the go"? I already have about 50TB > >> >> of data in the system, and have no place big enough to back it up to > >> >> rebuild the pvfs2 array and restore.... > >> > > >> > Unfortunately, not that I know of. You can set the extended > attributes, > >> > mentioned previously, on all directories so new files will use a > >> > different > >> > stripe size. Ideally, the strip per server will be equal to some unit > >> > that > >> > your underlying file system and storage digest efficiently, like a > RAID > >> > stripe. Did a larger stripe improve your observed throughput? > >> > > >> > Michael > >> > > >> >> > >> >> > >> >> > >> >> --Jim > >> >> > >> >> On Fri, Sep 30, 2011 at 1:46 PM, Michael Moore <[email protected] > > > >> >> wrote: > >> >> > See below for specific items. Can you run iostat on the servers > while > >> >> > writing a file that experiences the slow performance? If you could > >> >> > watch > >> >> > iostat -dmx <device of pvfs storage space> and provide any salient > >> >> > snippets > >> >> > (high utilization, low utilization, odd looking output, etc) that > >> >> > could > >> >> > help. > >> >> > > >> >> > On Thu, Sep 29, 2011 at 11:42 AM, Jim Kusznir <[email protected]> > >> >> > wrote: > >> >> >> > >> >> >> 1) iperf (defaults) reported 873, 884, and 929 for connections > form > >> >> >> the three servers to the head node (a pvfs2 client) > >> >> > > >> >> > Just to be clear, those are Mbps, right? > >> >> > > >> >> >> > >> >> >> 2) no errors showed up on any of the ports on the managed switch. > >> >> > > >> >> > Hmm, if those are Mbps not seeming to be a network layer. > >> >> >> > >> >> >> 3) I'm not sure what this will do, as the pvfs2 volume is > comprised > >> >> >> of > >> >> >> 3 servers, so mounting it on a server still uses the network for > the > >> >> >> other two. I also don't understand "single file per datafile" > >> >> >> statement. In any case, I do not have the kernel module compiled > on > >> >> >> my servers; they ONLY have the pvfs2 server software installed. > >> >> >> > >> >> > > >> >> > A logical file (e.g. foo.out) in a PVFS2 file system is made up of > >> >> > one > >> >> > or > >> >> > more datafiles. Based on your config I would assume most are made > up > >> >> > of > >> >> > 3 > >> >> > datafiles with the default stripe size of 64k. > >> >> > > >> >> > You can run pvfs2-viewdist -f <file name> to see what the > >> >> > distribution > >> >> > and > >> >> > what servers a given file lives on. To see cumulative throughput > from > >> >> > multiple PVFS2 servers the number of datafiles must be greater than > >> >> > one. > >> >> > Check a couple of the problematic files to see what their > >> >> > distribution > >> >> > is. > >> >> > > >> >> > For a quick test to see if the distribution is impacting > performance > >> >> > set > >> >> > the > >> >> > following extended attribute on a directory and then check the > >> >> > performance > >> >> > of writing a file into it: > >> >> > setfattr -n user.pvfs2.num_dfiles -v "3" <some pvfs2 dir> > >> >> > > >> >> > Also, you can test if a larger strip_size would help doing a > >> >> > something > >> >> > similar to (for 256k strip): > >> >> > setfattr -n user.pvfs2.dist_name -v simple_stripe <some pvfs2 dir> > >> >> > setfattr -n user.pvfs2.dist_params -v strip_size:262144 <some pvfs2 > >> >> > dir> > >> >> > > >> >> >> > >> >> >> 4) I'm not sure; I used largely defaults. I've attached my config > >> >> >> below. > >> >> >> > >> >> >> 5) the network bandwidth is on one of the servers (the one I > >> >> >> checked; > >> >> >> I believe them to all be similar). > >> >> >> > >> >> >> 6) Not sure. I have created an XFS filesystem using LVM to > combine > >> >> >> the two hardware raid6 volumes and mounted that at /mnt/pvfs2 on > the > >> >> >> servers. I then let pvfs do its magic. Config files below. > >> >> >> > >> >> >> 7(from second e-mail): Config file attached. > >> >> >> > >> >> >> ---------- > >> >> >> /etc/pvfs2-fs.conf: > >> >> >> ---------- > >> >> >> [root@pvfs2-io-0-2 mnt]# cat /etc/pvfs2-fs.conf > >> >> >> <Defaults> > >> >> >> UnexpectedRequests 50 > >> >> >> EventLogging none > >> >> >> LogStamp datetime > >> >> >> BMIModules bmi_tcp > >> >> >> FlowModules flowproto_multiqueue > >> >> >> PerfUpdateInterval 1000 > >> >> >> ServerJobBMITimeoutSecs 30 > >> >> >> ServerJobFlowTimeoutSecs 30 > >> >> >> ClientJobBMITimeoutSecs 300 > >> >> >> ClientJobFlowTimeoutSecs 300 > >> >> >> ClientRetryLimit 5 > >> >> >> ClientRetryDelayMilliSecs 2000 > >> >> >> StorageSpace /mnt/pvfs2 > >> >> >> LogFile /var/log/pvfs2-server.log > >> >> >> </Defaults> > >> >> >> > >> >> >> <Aliases> > >> >> >> Alias pvfs2-io-0-0 tcp://pvfs2-io-0-0:3334 > >> >> >> Alias pvfs2-io-0-1 tcp://pvfs2-io-0-1:3334 > >> >> >> Alias pvfs2-io-0-2 tcp://pvfs2-io-0-2:3334 > >> >> >> </Aliases> > >> >> >> > >> >> >> <Filesystem> > >> >> >> Name pvfs2-fs > >> >> >> ID 62659950 > >> >> >> RootHandle 1048576 > >> >> >> <MetaHandleRanges> > >> >> >> Range pvfs2-io-0-0 4-715827885 > >> >> >> Range pvfs2-io-0-1 715827886-1431655767 > >> >> >> Range pvfs2-io-0-2 1431655768-2147483649 > >> >> >> </MetaHandleRanges> > >> >> >> <DataHandleRanges> > >> >> >> Range pvfs2-io-0-0 2147483650-2863311531 > >> >> >> Range pvfs2-io-0-1 2863311532-3579139413 > >> >> >> Range pvfs2-io-0-2 3579139414-4294967295 > >> >> >> </DataHandleRanges> > >> >> >> <StorageHints> > >> >> >> TroveSyncMeta yes > >> >> >> TroveSyncData no > >> >> >> </StorageHints> > >> >> >> </Filesystem> > >> >> >> > >> >> >> > >> >> >> --------------------- > >> >> >> /etc/pvfs2-server.conf-pvfs2-io-0-2 > >> >> >> --------------------- > >> >> >> StorageSpace /mnt/pvfs2 > >> >> >> HostID "tcp://pvfs2-io-0-2:3334" > >> >> >> LogFile /var/log/pvfs2-server.log > >> >> >> --------------------- > >> >> >> > >> >> >> All the server config files are very similar. > >> >> >> > >> >> >> --Jim > >> >> >> > >> >> >> > >> >> >> On Wed, Sep 28, 2011 at 4:45 PM, Michael Moore > >> >> >> <[email protected]> > >> >> >> wrote: > >> >> >> > No doubt something is awry. Offhand I'm suspecting the network. > A > >> >> >> > couple > >> >> >> > things that might help give a direction: > >> >> >> > 1) Do an end-to-end TCP test between client/server. Something > like > >> >> >> > iperf > >> >> >> > or > >> >> >> > nuttcp should do the trick. > >> >> >> > 2) Check server and client ethernet ports on the switch for high > >> >> >> > error > >> >> >> > counts (not familiar with that switch, not sure if it's managed > or > >> >> >> > not). > >> >> >> > Hardware (port/cable) errors should show up in the above test. > >> >> >> > 3) Can you mount the PVFS2 file system on the server and run > some > >> >> >> > I/O > >> >> >> > tests > >> >> >> > (single datafile per file) to see if the network is in fact in > >> >> >> > play. > >> >> >> > 4) What are the number of datafiles (by default) each file > you're > >> >> >> > writing to > >> >> >> > is using? 3? > >> >> >> > 5) When you watch network bandwidth and see 10 MB/s where is > that? > >> >> >> > On > >> >> >> > the > >> >> >> > server? > >> >> >> > 6) What backend are you using for I/O, direct or alt-aio. > Nothing > >> >> >> > really > >> >> >> > wrong either way, just wondering. > >> >> >> > > >> >> >> > It sounds like based on the dd output the disks are capable of > >> >> >> > more > >> >> >> > than > >> >> >> > you're seeing, just need to narrow down where the performance is > >> >> >> > getting > >> >> >> > squelched. > >> >> >> > > >> >> >> > Michael > >> >> >> > > >> >> >> > On Wed, Sep 28, 2011 at 6:10 PM, Jim Kusznir < > [email protected]> > >> >> >> > wrote: > >> >> >> >> > >> >> >> >> Hi all: > >> >> >> >> > >> >> >> >> I've got a pvfs2 install on my cluster. I never felt it was > >> >> >> >> performing up to snuff, but lately it seems that things have > gone > >> >> >> >> way, > >> >> >> >> way down in total throughput and overall usability. To the > tune > >> >> >> >> that > >> >> >> >> jobs writing out 900MB will take an extra 1-2 hours to complete > >> >> >> >> due > >> >> >> >> to > >> >> >> >> disk I/O waits. A 2-hr job that would write about 30GB over > the > >> >> >> >> course of the run (normally about 2hrs long) takes up to 20hrs. > >> >> >> >> Once > >> >> >> >> the disk I/O is cut out, it completes in 1.5-2hrs. I've > noticed > >> >> >> >> personally that there's up to a 5 sec lag time when I cd into > >> >> >> >> /mnt/pvfs2 and do an ls. Note that all of our operations are > >> >> >> >> using > >> >> >> >> the kernel module / mount point. Our problems and code base do > >> >> >> >> not > >> >> >> >> support the use of other tools (such as the pvfs2-* or the > native > >> >> >> >> MPI > >> >> >> >> libraries); its all done through the kernel module / filesystem > >> >> >> >> mountpoint. > >> >> >> >> > >> >> >> >> My configuration is this: 3 pvfs2 servers (Dell PowerEdge > 1950's > >> >> >> >> with > >> >> >> >> 1.6Ghz quad-core CPUs, 4GB ram, raid-0 for metadata+os on > perc5i > >> >> >> >> card), Dell Perc6e card with hardware raid6 in two volumes: one > >> >> >> >> on a > >> >> >> >> bunch of 750GB sata drives, and the other on its second SAS > >> >> >> >> connector > >> >> >> >> to about 12 2tb WD drives. The two raid volumes are lvm'ed > >> >> >> >> together > >> >> >> >> in the OS and mounted as the pvfs2 data store. Each server is > >> >> >> >> connected via ethernet to a stack of LG-errison gig-e switches > >> >> >> >> (stack==2 switches with 40Gbit stacking cables installed). > PVFS > >> >> >> >> 2.8.2 > >> >> >> >> used throughout the cluster on Rocks (using site-compiled pvfs, > >> >> >> >> not > >> >> >> >> the rocks-supplied pvfs). OSes are CentOS5-x-based (both > clients > >> >> >> >> and > >> >> >> >> servers). > >> >> >> >> > >> >> >> >> As I said, I always felt something wasn't quite right, but a > few > >> >> >> >> months back, I performed a series of upgrades and > >> >> >> >> reconfigurations > >> >> >> >> on > >> >> >> >> the infrastructure and hardware. Specifically, I upgraded to > the > >> >> >> >> lg-errison switches and replaced a full 12-bay drive shelf with > a > >> >> >> >> 24-bay one (moving all the disks through) and adding some > >> >> >> >> additional > >> >> >> >> disks. All three pvfs2 servers are identical in this. At some > >> >> >> >> point > >> >> >> >> prior to these changes, my users were able to get acceptable > >> >> >> >> performance from pvfs2; now they are not. I don't have any > >> >> >> >> evidence > >> >> >> >> pointing to the switch or to the disks. > >> >> >> >> > >> >> >> >> I can run dd if=/dev/zero of=testfile bs=1024k count=10000 and > >> >> >> >> get > >> >> >> >> 380+MB/s locally on the pvfs server, writing to the partition > on > >> >> >> >> the > >> >> >> >> hardware raid6 card. From a compute node, doing that for 100MB > >> >> >> >> file, > >> >> >> >> I get 47.7MB/s to my RAID-5 NFS server on the head node, and > >> >> >> >> 36.5MB/s > >> >> >> >> to my pvfs2 mounted share. When I watch the network > >> >> >> >> bandwidth/throughput using bwm-ng, I rarely see more than > 10MB/s, > >> >> >> >> and > >> >> >> >> often its around 4MB/s with a 12-node IO-bound job running. > >> >> >> >> > >> >> >> >> I originally had the pvfs2 servers connected to the switch with > >> >> >> >> dual > >> >> >> >> gig-e connections and using bonding (ALB) to make it more able > to > >> >> >> >> serve multiple nodes. I never saw anywhere close to the > >> >> >> >> throughput > >> >> >> >> I > >> >> >> >> should. In any case, to test of that was the problem, I > removed > >> >> >> >> the > >> >> >> >> bonding and am running through a single gig-e pipe now, but > >> >> >> >> performance hasn't improved at all. > >> >> >> >> > >> >> >> >> I'm not sure how to troubleshoot this problem further. > >> >> >> >> Presently, > >> >> >> >> the > >> >> >> >> cluster isn't usable for large I/O jobs, so I really have to > fix > >> >> >> >> this. > >> >> >> >> > >> >> >> >> --Jim > >> >> >> >> _______________________________________________ > >> >> >> >> Pvfs2-users mailing list > >> >> >> >> [email protected] > >> >> >> >> > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users > >> >> >> > > >> >> >> > > >> >> > > >> >> > > >> > > >> > > > > > > > _______________________________________________ > Pvfs2-users mailing list > [email protected] > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users >
_______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
