See below for specific items. Can you run iostat on the servers while
writing a file that experiences the slow performance? If you could watch
iostat -dmx <device of pvfs storage space> and provide any salient snippets
(high utilization, low utilization, odd looking output, etc) that could
help.

On Thu, Sep 29, 2011 at 11:42 AM, Jim Kusznir <[email protected]> wrote:

> 1) iperf (defaults) reported 873, 884, and 929 for connections form
> the three servers to the head node (a pvfs2 client)
>

Just to be clear, those are Mbps, right?


>
> 2) no errors showed up on any of the ports on the managed switch.
>

Hmm, if those are Mbps not seeming to be a network layer.

>
> 3) I'm not sure what this will do, as the pvfs2 volume is comprised of
> 3 servers, so mounting it on a server still uses the network for the
> other two.  I also don't understand "single file per datafile"
> statement.  In any case, I do not have the kernel module compiled on
> my servers; they ONLY have the pvfs2 server software installed.
>
>
A logical file (e.g. foo.out) in a PVFS2 file system is made up of one or
more datafiles. Based on your config I would assume most are made up of 3
datafiles with the default stripe size of 64k.

You can run pvfs2-viewdist -f <file name> to see what the distribution and
what servers a given file lives on. To see cumulative throughput from
multiple PVFS2 servers the number of datafiles must be greater than one.
Check a couple of the problematic files to see what their distribution is.

For a quick test to see if the distribution is impacting performance set the
following extended attribute on a directory and then check the performance
of writing a file into it:
setfattr -n user.pvfs2.num_dfiles -v "3" <some pvfs2 dir>

Also, you can test if a larger strip_size would help doing a something
similar to (for 256k strip):
setfattr -n user.pvfs2.dist_name -v simple_stripe <some pvfs2 dir>
setfattr -n user.pvfs2.dist_params -v strip_size:262144 <some pvfs2 dir>


> 4) I'm not sure; I used largely defaults.  I've attached my config below.
>

> 5) the network bandwidth is on one of the servers (the one I checked;
> I believe them to all be similar).
>
> 6) Not sure.  I have created an XFS filesystem using LVM to combine
> the two hardware raid6 volumes and mounted that at /mnt/pvfs2 on the
> servers.  I then let pvfs do its magic.  Config files below.
>

> 7(from second e-mail): Config file attached.
>
> ----------
> /etc/pvfs2-fs.conf:
> ----------
> [root@pvfs2-io-0-2 mnt]# cat /etc/pvfs2-fs.conf
> <Defaults>
>        UnexpectedRequests 50
>        EventLogging none
>        LogStamp datetime
>        BMIModules bmi_tcp
>        FlowModules flowproto_multiqueue
>        PerfUpdateInterval 1000
>        ServerJobBMITimeoutSecs 30
>        ServerJobFlowTimeoutSecs 30
>        ClientJobBMITimeoutSecs 300
>        ClientJobFlowTimeoutSecs 300
>        ClientRetryLimit 5
>        ClientRetryDelayMilliSecs 2000
>        StorageSpace /mnt/pvfs2
>        LogFile /var/log/pvfs2-server.log
> </Defaults>
>
> <Aliases>
>        Alias pvfs2-io-0-0 tcp://pvfs2-io-0-0:3334
>        Alias pvfs2-io-0-1 tcp://pvfs2-io-0-1:3334
>        Alias pvfs2-io-0-2 tcp://pvfs2-io-0-2:3334
> </Aliases>
>
> <Filesystem>
>        Name pvfs2-fs
>        ID 62659950
>        RootHandle 1048576
>        <MetaHandleRanges>
>                Range pvfs2-io-0-0 4-715827885
>                Range pvfs2-io-0-1 715827886-1431655767
>                Range pvfs2-io-0-2 1431655768-2147483649
>        </MetaHandleRanges>
>        <DataHandleRanges>
>                Range pvfs2-io-0-0 2147483650-2863311531
>                Range pvfs2-io-0-1 2863311532-3579139413
>                Range pvfs2-io-0-2 3579139414-4294967295
>        </DataHandleRanges>
>        <StorageHints>
>                TroveSyncMeta yes
>                TroveSyncData no
>        </StorageHints>
> </Filesystem>
>
>
> ---------------------
> /etc/pvfs2-server.conf-pvfs2-io-0-2
> ---------------------
> StorageSpace /mnt/pvfs2
> HostID "tcp://pvfs2-io-0-2:3334"
> LogFile /var/log/pvfs2-server.log
> ---------------------
>
> All the server config files are very similar.
>
> --Jim
>
>
> On Wed, Sep 28, 2011 at 4:45 PM, Michael Moore <[email protected]>
> wrote:
> > No doubt something is awry. Offhand I'm suspecting the network. A couple
> > things that might help give a direction:
> > 1) Do an end-to-end TCP test between client/server. Something like iperf
> or
> > nuttcp should do the trick.
> > 2) Check server and client ethernet ports on the switch for high error
> > counts (not familiar with that switch, not sure if it's managed or not).
> > Hardware (port/cable) errors should show up in the above test.
> > 3) Can you mount the PVFS2 file system on the server and run some I/O
> tests
> > (single datafile per file) to see if the network is in fact in play.
> > 4) What are the number of datafiles (by default) each file you're writing
> to
> > is using? 3?
> > 5) When you watch network bandwidth and see 10 MB/s where is that? On the
> > server?
> > 6) What backend are you using for I/O, direct or alt-aio. Nothing really
> > wrong either way, just wondering.
> >
> > It sounds like based on the dd output the disks are capable of more than
> > you're seeing, just need to narrow down where the performance is getting
> > squelched.
> >
> > Michael
> >
> > On Wed, Sep 28, 2011 at 6:10 PM, Jim Kusznir <[email protected]> wrote:
> >>
> >> Hi all:
> >>
> >> I've got a pvfs2 install on my cluster.  I never felt it was
> >> performing up to snuff, but lately it seems that things have gone way,
> >> way down in total throughput and overall usability.  To the tune that
> >> jobs writing out 900MB will take an extra 1-2 hours to complete due to
> >> disk I/O waits.  A 2-hr job that would write about 30GB over the
> >> course of the run (normally about 2hrs long) takes up to 20hrs.  Once
> >> the disk I/O is cut out, it completes in 1.5-2hrs.  I've noticed
> >> personally that there's up to a 5 sec lag time when I cd into
> >> /mnt/pvfs2 and do an ls.  Note that all of our operations are using
> >> the kernel module / mount point.  Our problems and code base do not
> >> support the use of other tools (such as the pvfs2-* or the native MPI
> >> libraries); its all done through the kernel module / filesystem
> >> mountpoint.
> >>
> >> My configuration is this:  3 pvfs2 servers (Dell PowerEdge 1950's with
> >> 1.6Ghz quad-core CPUs, 4GB ram, raid-0 for metadata+os on perc5i
> >> card), Dell Perc6e card with hardware raid6 in two volumes: one on a
> >> bunch of 750GB sata drives, and the other on its second SAS connector
> >> to about 12 2tb WD drives.  The two raid volumes are lvm'ed together
> >> in the OS and mounted as the pvfs2 data store.  Each server is
> >> connected via ethernet to a stack of LG-errison gig-e switches
> >> (stack==2 switches with 40Gbit stacking cables installed).  PVFS 2.8.2
> >> used throughout the cluster on Rocks (using site-compiled pvfs, not
> >> the rocks-supplied pvfs).  OSes are CentOS5-x-based (both clients and
> >> servers).
> >>
> >> As I said, I always felt something wasn't quite right, but a few
> >> months back, I performed a series of upgrades and reconfigurations on
> >> the infrastructure and hardware.  Specifically, I upgraded to the
> >> lg-errison switches and replaced a full 12-bay drive shelf with a
> >> 24-bay one (moving all the disks through) and adding some additional
> >> disks.  All three pvfs2 servers are identical in this.  At some point
> >> prior to these changes, my users were able to get acceptable
> >> performance from pvfs2; now they are not.  I don't have any evidence
> >> pointing to the switch or to the disks.
> >>
> >> I can run dd if=/dev/zero of=testfile bs=1024k count=10000 and get
> >> 380+MB/s locally on the pvfs server, writing to the partition on the
> >> hardware raid6 card.  From a compute node, doing that for 100MB file,
> >> I get 47.7MB/s to my RAID-5 NFS server on the head node, and 36.5MB/s
> >> to my pvfs2 mounted share.  When I watch the network
> >> bandwidth/throughput using bwm-ng, I rarely see more than 10MB/s, and
> >> often its around 4MB/s with a 12-node IO-bound job running.
> >>
> >> I originally had the pvfs2 servers connected to the switch with dual
> >> gig-e connections and using bonding (ALB) to make it more able to
> >> serve multiple nodes.  I never saw anywhere close to the throughput I
> >> should.  In any case, to test of that was the problem, I removed the
> >> bonding and am running through a single gig-e pipe now, but
> >> performance hasn't improved at all.
> >>
> >> I'm not sure how to troubleshoot this problem further.  Presently, the
> >> cluster isn't usable for large I/O jobs, so I really have to fix this.
> >>
> >> --Jim
> >> _______________________________________________
> >> Pvfs2-users mailing list
> >> [email protected]
> >> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
> >
> >
>
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to