On Wed, Oct 5, 2011 at 11:44 AM, Jim Kusznir <[email protected]> wrote:

> I got some more information today.  A user had run an I/O intensive
> job after the upgrades, and did not experience problems.  However,
> that job now is.  They think it might be due to increased job load,
> but it appears most of the jobs running are doing little I/O except
> theirs, so I'm skeptical.
>
> As to watching the load on the servers, I have tried.  I do watch top
> and occasionally will go down to the servers themselves and watch the
> disk I/O lights.  I've watched network I/O (using bwm-ng).  All of
> these indicate its loafing around at 10%-15% of capacity.  I don't
> know how to watch actual IOPS or other more direct metrics.
>

Specifically, I was wanting to see iostat output. It will show disk
throughput, disk request size, IOPS, etc.


> By every measurement I can do, it works fine as long as pvfs is not
> involved.  Once one starts routing traffic through pvfs, performance
> drops to pathetic levels...I don't know how to go any further with
> this.  It used to work better than it does, too.
>
> The problem occurs both on my head node or my compute nodes.  (BTW: my
> head node gets reset/rebooted semi-frequently due to kernel OOPSes
> caused by the pvfs2 module, typically during times of high demand
> through the head node).
>

There have been several fixes for various kernel panics since 2.8.2, one
related to pvfs2 kernel memory being paged out may relate to your issue.

Michael

--Jim
>
> On Tue, Oct 4, 2011 at 9:04 AM, Michael Moore <[email protected]>
> wrote:
> > On Tue, Oct 4, 2011 at 11:43 AM, Jim Kusznir <[email protected]> wrote:
> >>
> >> I didn't try the stripe size; I misinterpreted your suggestion.
> >> Setting that doesn't re-stripe existing data, does it?  I think a lot
> >> of the I/O is reading existing data, then writing out some files.  I
> >> don't think I have a good means of getting hard numbers / strong
> >> evaluation for production loads, as I don't have a viable means of
> >> separating out old from new.  Just how long the run took, and I don't
> >> even know where all they're doing their I/O from.  There's lots of
> >> different directories (one directory took several days to chmod
> >> recursively the files there).
> >>
> >> Is this the type of thing I can set on the root, and it will
> >> recursively fall down?
> >
> > When setting this class of attributes on a directory it applies to files
> in
> > that directory. It is not recursive and does not affect existing files.
> For
> > example, the output shown by pvfs2-viewdist won't change on existing
> files
> > if you set the strip_size on a directory. However, new files should use
> the
> > new strip_size.
> >
> >>
> >> Also, how do I determine the correct stripe size to use?
> >
> > You'll likely need to play around it. It can be influenced from the
> storage
> > hardware to the application on a client writing data. I typically focus
> on
> > the storage side for determining the stripe size but that's for a
> 'general
> > purpose' file system. It sounds like you may have specific jobs you may
> want
> > to tune to.
> >
> >>
> >> I'm still concerned about the fact that the performance dropped so
> >> notably after replacing the switch and adding more storage to the
> >> array.  All of what you're pointing to sounds like it should have
> >> always been that bad...
> >>
> >
> > If the iperf tests you ran are representative of the traffic that occurs
> > when you see low performance then it indicates the network is likely not
> the
> > issue. However, if it's not (if I/O happens from compute, not your head
> node
> > for example) you may want to try from several compute nodes to see if
> > congestion dramatically reduces throughput or there are one or two
> trouble
> > nodes if collective I/O is being done.
> >
> > Watching disk and network performance on servers while low performing I/O
> is
> > occurring should indicate if the disk or network on one or more of the
> > servers is an issue. If not, then time to look at the clients closer.
> >
> > Others may have suggestions too, not meaning to prevent others from
> throwing
> > out ideas.
> >
> > Michael
> >
> >
> >>
> >> --Jim
> >>
> >> On Mon, Oct 3, 2011 at 11:54 AM, Michael Moore <[email protected]>
> >> wrote:
> >> > On Mon, Oct 3, 2011 at 2:38 PM, Jim Kusznir <[email protected]>
> wrote:
> >> >>
> >> >> All speeds were in Mpbs, the default from iperf.
> >> >>
> >> >> Our files are multi-GB in size, so they do involve all three servers.
> >> >> It applies to all files on the system.
> >> >
> >> > Okay, good, wanted to confirm.
> >> >
> >> >>
> >> >> Can I change the stripe size "on the go"?  I already have about 50TB
> >> >> of data in the system, and have no place big enough to back it up to
> >> >> rebuild the pvfs2 array and restore....
> >> >
> >> > Unfortunately, not that I know of. You can set the extended
> attributes,
> >> > mentioned previously, on all directories so new files will use a
> >> > different
> >> > stripe size. Ideally, the strip per server will be equal to some unit
> >> > that
> >> > your underlying file system and storage digest efficiently, like a
> RAID
> >> > stripe. Did a larger stripe improve your observed throughput?
> >> >
> >> > Michael
> >> >
> >> >>
> >> >>
> >> >>
> >> >> --Jim
> >> >>
> >> >> On Fri, Sep 30, 2011 at 1:46 PM, Michael Moore <[email protected]
> >
> >> >> wrote:
> >> >> > See below for specific items. Can you run iostat on the servers
> while
> >> >> > writing a file that experiences the slow performance? If you could
> >> >> > watch
> >> >> > iostat -dmx <device of pvfs storage space> and provide any salient
> >> >> > snippets
> >> >> > (high utilization, low utilization, odd looking output, etc) that
> >> >> > could
> >> >> > help.
> >> >> >
> >> >> > On Thu, Sep 29, 2011 at 11:42 AM, Jim Kusznir <[email protected]>
> >> >> > wrote:
> >> >> >>
> >> >> >> 1) iperf (defaults) reported 873, 884, and 929 for connections
> form
> >> >> >> the three servers to the head node (a pvfs2 client)
> >> >> >
> >> >> > Just to be clear, those are Mbps, right?
> >> >> >
> >> >> >>
> >> >> >> 2) no errors showed up on any of the ports on the managed switch.
> >> >> >
> >> >> > Hmm, if those are Mbps not seeming to be a network layer.
> >> >> >>
> >> >> >> 3) I'm not sure what this will do, as the pvfs2 volume is
> comprised
> >> >> >> of
> >> >> >> 3 servers, so mounting it on a server still uses the network for
> the
> >> >> >> other two.  I also don't understand "single file per datafile"
> >> >> >> statement.  In any case, I do not have the kernel module compiled
> on
> >> >> >> my servers; they ONLY have the pvfs2 server software installed.
> >> >> >>
> >> >> >
> >> >> > A logical file (e.g. foo.out) in a PVFS2 file system is made up of
> >> >> > one
> >> >> > or
> >> >> > more datafiles. Based on your config I would assume most are made
> up
> >> >> > of
> >> >> > 3
> >> >> > datafiles with the default stripe size of 64k.
> >> >> >
> >> >> > You can run pvfs2-viewdist -f <file name> to see what the
> >> >> > distribution
> >> >> > and
> >> >> > what servers a given file lives on. To see cumulative throughput
> from
> >> >> > multiple PVFS2 servers the number of datafiles must be greater than
> >> >> > one.
> >> >> > Check a couple of the problematic files to see what their
> >> >> > distribution
> >> >> > is.
> >> >> >
> >> >> > For a quick test to see if the distribution is impacting
> performance
> >> >> > set
> >> >> > the
> >> >> > following extended attribute on a directory and then check the
> >> >> > performance
> >> >> > of writing a file into it:
> >> >> > setfattr -n user.pvfs2.num_dfiles -v "3" <some pvfs2 dir>
> >> >> >
> >> >> > Also, you can test if a larger strip_size would help doing a
> >> >> > something
> >> >> > similar to (for 256k strip):
> >> >> > setfattr -n user.pvfs2.dist_name -v simple_stripe <some pvfs2 dir>
> >> >> > setfattr -n user.pvfs2.dist_params -v strip_size:262144 <some pvfs2
> >> >> > dir>
> >> >> >
> >> >> >>
> >> >> >> 4) I'm not sure; I used largely defaults.  I've attached my config
> >> >> >> below.
> >> >> >>
> >> >> >> 5) the network bandwidth is on one of the servers (the one I
> >> >> >> checked;
> >> >> >> I believe them to all be similar).
> >> >> >>
> >> >> >> 6) Not sure.  I have created an XFS filesystem using LVM to
> combine
> >> >> >> the two hardware raid6 volumes and mounted that at /mnt/pvfs2 on
> the
> >> >> >> servers.  I then let pvfs do its magic.  Config files below.
> >> >> >>
> >> >> >> 7(from second e-mail): Config file attached.
> >> >> >>
> >> >> >> ----------
> >> >> >> /etc/pvfs2-fs.conf:
> >> >> >> ----------
> >> >> >> [root@pvfs2-io-0-2 mnt]# cat /etc/pvfs2-fs.conf
> >> >> >> <Defaults>
> >> >> >>        UnexpectedRequests 50
> >> >> >>        EventLogging none
> >> >> >>        LogStamp datetime
> >> >> >>        BMIModules bmi_tcp
> >> >> >>        FlowModules flowproto_multiqueue
> >> >> >>        PerfUpdateInterval 1000
> >> >> >>        ServerJobBMITimeoutSecs 30
> >> >> >>        ServerJobFlowTimeoutSecs 30
> >> >> >>        ClientJobBMITimeoutSecs 300
> >> >> >>        ClientJobFlowTimeoutSecs 300
> >> >> >>        ClientRetryLimit 5
> >> >> >>        ClientRetryDelayMilliSecs 2000
> >> >> >>        StorageSpace /mnt/pvfs2
> >> >> >>        LogFile /var/log/pvfs2-server.log
> >> >> >> </Defaults>
> >> >> >>
> >> >> >> <Aliases>
> >> >> >>        Alias pvfs2-io-0-0 tcp://pvfs2-io-0-0:3334
> >> >> >>        Alias pvfs2-io-0-1 tcp://pvfs2-io-0-1:3334
> >> >> >>        Alias pvfs2-io-0-2 tcp://pvfs2-io-0-2:3334
> >> >> >> </Aliases>
> >> >> >>
> >> >> >> <Filesystem>
> >> >> >>        Name pvfs2-fs
> >> >> >>        ID 62659950
> >> >> >>        RootHandle 1048576
> >> >> >>        <MetaHandleRanges>
> >> >> >>                Range pvfs2-io-0-0 4-715827885
> >> >> >>                Range pvfs2-io-0-1 715827886-1431655767
> >> >> >>                Range pvfs2-io-0-2 1431655768-2147483649
> >> >> >>        </MetaHandleRanges>
> >> >> >>        <DataHandleRanges>
> >> >> >>                Range pvfs2-io-0-0 2147483650-2863311531
> >> >> >>                Range pvfs2-io-0-1 2863311532-3579139413
> >> >> >>                Range pvfs2-io-0-2 3579139414-4294967295
> >> >> >>        </DataHandleRanges>
> >> >> >>        <StorageHints>
> >> >> >>                TroveSyncMeta yes
> >> >> >>                TroveSyncData no
> >> >> >>        </StorageHints>
> >> >> >> </Filesystem>
> >> >> >>
> >> >> >>
> >> >> >> ---------------------
> >> >> >> /etc/pvfs2-server.conf-pvfs2-io-0-2
> >> >> >> ---------------------
> >> >> >> StorageSpace /mnt/pvfs2
> >> >> >> HostID "tcp://pvfs2-io-0-2:3334"
> >> >> >> LogFile /var/log/pvfs2-server.log
> >> >> >> ---------------------
> >> >> >>
> >> >> >> All the server config files are very similar.
> >> >> >>
> >> >> >> --Jim
> >> >> >>
> >> >> >>
> >> >> >> On Wed, Sep 28, 2011 at 4:45 PM, Michael Moore
> >> >> >> <[email protected]>
> >> >> >> wrote:
> >> >> >> > No doubt something is awry. Offhand I'm suspecting the network.
> A
> >> >> >> > couple
> >> >> >> > things that might help give a direction:
> >> >> >> > 1) Do an end-to-end TCP test between client/server. Something
> like
> >> >> >> > iperf
> >> >> >> > or
> >> >> >> > nuttcp should do the trick.
> >> >> >> > 2) Check server and client ethernet ports on the switch for high
> >> >> >> > error
> >> >> >> > counts (not familiar with that switch, not sure if it's managed
> or
> >> >> >> > not).
> >> >> >> > Hardware (port/cable) errors should show up in the above test.
> >> >> >> > 3) Can you mount the PVFS2 file system on the server and run
> some
> >> >> >> > I/O
> >> >> >> > tests
> >> >> >> > (single datafile per file) to see if the network is in fact in
> >> >> >> > play.
> >> >> >> > 4) What are the number of datafiles (by default) each file
> you're
> >> >> >> > writing to
> >> >> >> > is using? 3?
> >> >> >> > 5) When you watch network bandwidth and see 10 MB/s where is
> that?
> >> >> >> > On
> >> >> >> > the
> >> >> >> > server?
> >> >> >> > 6) What backend are you using for I/O, direct or alt-aio.
> Nothing
> >> >> >> > really
> >> >> >> > wrong either way, just wondering.
> >> >> >> >
> >> >> >> > It sounds like based on the dd output the disks are capable of
> >> >> >> > more
> >> >> >> > than
> >> >> >> > you're seeing, just need to narrow down where the performance is
> >> >> >> > getting
> >> >> >> > squelched.
> >> >> >> >
> >> >> >> > Michael
> >> >> >> >
> >> >> >> > On Wed, Sep 28, 2011 at 6:10 PM, Jim Kusznir <
> [email protected]>
> >> >> >> > wrote:
> >> >> >> >>
> >> >> >> >> Hi all:
> >> >> >> >>
> >> >> >> >> I've got a pvfs2 install on my cluster.  I never felt it was
> >> >> >> >> performing up to snuff, but lately it seems that things have
> gone
> >> >> >> >> way,
> >> >> >> >> way down in total throughput and overall usability.  To the
> tune
> >> >> >> >> that
> >> >> >> >> jobs writing out 900MB will take an extra 1-2 hours to complete
> >> >> >> >> due
> >> >> >> >> to
> >> >> >> >> disk I/O waits.  A 2-hr job that would write about 30GB over
> the
> >> >> >> >> course of the run (normally about 2hrs long) takes up to 20hrs.
> >> >> >> >>  Once
> >> >> >> >> the disk I/O is cut out, it completes in 1.5-2hrs.  I've
> noticed
> >> >> >> >> personally that there's up to a 5 sec lag time when I cd into
> >> >> >> >> /mnt/pvfs2 and do an ls.  Note that all of our operations are
> >> >> >> >> using
> >> >> >> >> the kernel module / mount point.  Our problems and code base do
> >> >> >> >> not
> >> >> >> >> support the use of other tools (such as the pvfs2-* or the
> native
> >> >> >> >> MPI
> >> >> >> >> libraries); its all done through the kernel module / filesystem
> >> >> >> >> mountpoint.
> >> >> >> >>
> >> >> >> >> My configuration is this:  3 pvfs2 servers (Dell PowerEdge
> 1950's
> >> >> >> >> with
> >> >> >> >> 1.6Ghz quad-core CPUs, 4GB ram, raid-0 for metadata+os on
> perc5i
> >> >> >> >> card), Dell Perc6e card with hardware raid6 in two volumes: one
> >> >> >> >> on a
> >> >> >> >> bunch of 750GB sata drives, and the other on its second SAS
> >> >> >> >> connector
> >> >> >> >> to about 12 2tb WD drives.  The two raid volumes are lvm'ed
> >> >> >> >> together
> >> >> >> >> in the OS and mounted as the pvfs2 data store.  Each server is
> >> >> >> >> connected via ethernet to a stack of LG-errison gig-e switches
> >> >> >> >> (stack==2 switches with 40Gbit stacking cables installed).
>  PVFS
> >> >> >> >> 2.8.2
> >> >> >> >> used throughout the cluster on Rocks (using site-compiled pvfs,
> >> >> >> >> not
> >> >> >> >> the rocks-supplied pvfs).  OSes are CentOS5-x-based (both
> clients
> >> >> >> >> and
> >> >> >> >> servers).
> >> >> >> >>
> >> >> >> >> As I said, I always felt something wasn't quite right, but a
> few
> >> >> >> >> months back, I performed a series of upgrades and
> >> >> >> >> reconfigurations
> >> >> >> >> on
> >> >> >> >> the infrastructure and hardware.  Specifically, I upgraded to
> the
> >> >> >> >> lg-errison switches and replaced a full 12-bay drive shelf with
> a
> >> >> >> >> 24-bay one (moving all the disks through) and adding some
> >> >> >> >> additional
> >> >> >> >> disks.  All three pvfs2 servers are identical in this.  At some
> >> >> >> >> point
> >> >> >> >> prior to these changes, my users were able to get acceptable
> >> >> >> >> performance from pvfs2; now they are not.  I don't have any
> >> >> >> >> evidence
> >> >> >> >> pointing to the switch or to the disks.
> >> >> >> >>
> >> >> >> >> I can run dd if=/dev/zero of=testfile bs=1024k count=10000 and
> >> >> >> >> get
> >> >> >> >> 380+MB/s locally on the pvfs server, writing to the partition
> on
> >> >> >> >> the
> >> >> >> >> hardware raid6 card.  From a compute node, doing that for 100MB
> >> >> >> >> file,
> >> >> >> >> I get 47.7MB/s to my RAID-5 NFS server on the head node, and
> >> >> >> >> 36.5MB/s
> >> >> >> >> to my pvfs2 mounted share.  When I watch the network
> >> >> >> >> bandwidth/throughput using bwm-ng, I rarely see more than
> 10MB/s,
> >> >> >> >> and
> >> >> >> >> often its around 4MB/s with a 12-node IO-bound job running.
> >> >> >> >>
> >> >> >> >> I originally had the pvfs2 servers connected to the switch with
> >> >> >> >> dual
> >> >> >> >> gig-e connections and using bonding (ALB) to make it more able
> to
> >> >> >> >> serve multiple nodes.  I never saw anywhere close to the
> >> >> >> >> throughput
> >> >> >> >> I
> >> >> >> >> should.  In any case, to test of that was the problem, I
> removed
> >> >> >> >> the
> >> >> >> >> bonding and am running through a single gig-e pipe now, but
> >> >> >> >> performance hasn't improved at all.
> >> >> >> >>
> >> >> >> >> I'm not sure how to troubleshoot this problem further.
> >> >> >> >>  Presently,
> >> >> >> >> the
> >> >> >> >> cluster isn't usable for large I/O jobs, so I really have to
> fix
> >> >> >> >> this.
> >> >> >> >>
> >> >> >> >> --Jim
> >> >> >> >> _______________________________________________
> >> >> >> >> Pvfs2-users mailing list
> >> >> >> >> [email protected]
> >> >> >> >>
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
> >> >> >> >
> >> >> >> >
> >> >> >
> >> >> >
> >> >
> >> >
> >
> >
>
> _______________________________________________
> Pvfs2-users mailing list
> [email protected]
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to