On Mon, Oct 3, 2011 at 2:38 PM, Jim Kusznir <[email protected]> wrote:

> All speeds were in Mpbs, the default from iperf.
>
> Our files are multi-GB in size, so they do involve all three servers.
> It applies to all files on the system.
>

Okay, good, wanted to confirm.


>
> Can I change the stripe size "on the go"?  I already have about 50TB
> of data in the system, and have no place big enough to back it up to
> rebuild the pvfs2 array and restore....
>

Unfortunately, not that I know of. You can set the extended attributes,
mentioned previously, on all directories so new files will use a different
stripe size. Ideally, the strip per server will be equal to some unit that
your underlying file system and storage digest efficiently, like a RAID
stripe. Did a larger stripe improve your observed throughput?

Michael


>
>
--Jim
>
> On Fri, Sep 30, 2011 at 1:46 PM, Michael Moore <[email protected]>
> wrote:
> > See below for specific items. Can you run iostat on the servers while
> > writing a file that experiences the slow performance? If you could watch
> > iostat -dmx <device of pvfs storage space> and provide any salient
> snippets
> > (high utilization, low utilization, odd looking output, etc) that could
> > help.
> >
> > On Thu, Sep 29, 2011 at 11:42 AM, Jim Kusznir <[email protected]>
> wrote:
> >>
> >> 1) iperf (defaults) reported 873, 884, and 929 for connections form
> >> the three servers to the head node (a pvfs2 client)
> >
> > Just to be clear, those are Mbps, right?
> >
> >>
> >> 2) no errors showed up on any of the ports on the managed switch.
> >
> > Hmm, if those are Mbps not seeming to be a network layer.
> >>
> >> 3) I'm not sure what this will do, as the pvfs2 volume is comprised of
> >> 3 servers, so mounting it on a server still uses the network for the
> >> other two.  I also don't understand "single file per datafile"
> >> statement.  In any case, I do not have the kernel module compiled on
> >> my servers; they ONLY have the pvfs2 server software installed.
> >>
> >
> > A logical file (e.g. foo.out) in a PVFS2 file system is made up of one or
> > more datafiles. Based on your config I would assume most are made up of 3
> > datafiles with the default stripe size of 64k.
> >
> > You can run pvfs2-viewdist -f <file name> to see what the distribution
> and
> > what servers a given file lives on. To see cumulative throughput from
> > multiple PVFS2 servers the number of datafiles must be greater than one.
> > Check a couple of the problematic files to see what their distribution
> is.
> >
> > For a quick test to see if the distribution is impacting performance set
> the
> > following extended attribute on a directory and then check the
> performance
> > of writing a file into it:
> > setfattr -n user.pvfs2.num_dfiles -v "3" <some pvfs2 dir>
> >
> > Also, you can test if a larger strip_size would help doing a something
> > similar to (for 256k strip):
> > setfattr -n user.pvfs2.dist_name -v simple_stripe <some pvfs2 dir>
> > setfattr -n user.pvfs2.dist_params -v strip_size:262144 <some pvfs2 dir>
> >
> >>
> >> 4) I'm not sure; I used largely defaults.  I've attached my config
> below.
> >>
> >> 5) the network bandwidth is on one of the servers (the one I checked;
> >> I believe them to all be similar).
> >>
> >> 6) Not sure.  I have created an XFS filesystem using LVM to combine
> >> the two hardware raid6 volumes and mounted that at /mnt/pvfs2 on the
> >> servers.  I then let pvfs do its magic.  Config files below.
> >>
> >> 7(from second e-mail): Config file attached.
> >>
> >> ----------
> >> /etc/pvfs2-fs.conf:
> >> ----------
> >> [root@pvfs2-io-0-2 mnt]# cat /etc/pvfs2-fs.conf
> >> <Defaults>
> >>        UnexpectedRequests 50
> >>        EventLogging none
> >>        LogStamp datetime
> >>        BMIModules bmi_tcp
> >>        FlowModules flowproto_multiqueue
> >>        PerfUpdateInterval 1000
> >>        ServerJobBMITimeoutSecs 30
> >>        ServerJobFlowTimeoutSecs 30
> >>        ClientJobBMITimeoutSecs 300
> >>        ClientJobFlowTimeoutSecs 300
> >>        ClientRetryLimit 5
> >>        ClientRetryDelayMilliSecs 2000
> >>        StorageSpace /mnt/pvfs2
> >>        LogFile /var/log/pvfs2-server.log
> >> </Defaults>
> >>
> >> <Aliases>
> >>        Alias pvfs2-io-0-0 tcp://pvfs2-io-0-0:3334
> >>        Alias pvfs2-io-0-1 tcp://pvfs2-io-0-1:3334
> >>        Alias pvfs2-io-0-2 tcp://pvfs2-io-0-2:3334
> >> </Aliases>
> >>
> >> <Filesystem>
> >>        Name pvfs2-fs
> >>        ID 62659950
> >>        RootHandle 1048576
> >>        <MetaHandleRanges>
> >>                Range pvfs2-io-0-0 4-715827885
> >>                Range pvfs2-io-0-1 715827886-1431655767
> >>                Range pvfs2-io-0-2 1431655768-2147483649
> >>        </MetaHandleRanges>
> >>        <DataHandleRanges>
> >>                Range pvfs2-io-0-0 2147483650-2863311531
> >>                Range pvfs2-io-0-1 2863311532-3579139413
> >>                Range pvfs2-io-0-2 3579139414-4294967295
> >>        </DataHandleRanges>
> >>        <StorageHints>
> >>                TroveSyncMeta yes
> >>                TroveSyncData no
> >>        </StorageHints>
> >> </Filesystem>
> >>
> >>
> >> ---------------------
> >> /etc/pvfs2-server.conf-pvfs2-io-0-2
> >> ---------------------
> >> StorageSpace /mnt/pvfs2
> >> HostID "tcp://pvfs2-io-0-2:3334"
> >> LogFile /var/log/pvfs2-server.log
> >> ---------------------
> >>
> >> All the server config files are very similar.
> >>
> >> --Jim
> >>
> >>
> >> On Wed, Sep 28, 2011 at 4:45 PM, Michael Moore <[email protected]>
> >> wrote:
> >> > No doubt something is awry. Offhand I'm suspecting the network. A
> couple
> >> > things that might help give a direction:
> >> > 1) Do an end-to-end TCP test between client/server. Something like
> iperf
> >> > or
> >> > nuttcp should do the trick.
> >> > 2) Check server and client ethernet ports on the switch for high error
> >> > counts (not familiar with that switch, not sure if it's managed or
> not).
> >> > Hardware (port/cable) errors should show up in the above test.
> >> > 3) Can you mount the PVFS2 file system on the server and run some I/O
> >> > tests
> >> > (single datafile per file) to see if the network is in fact in play.
> >> > 4) What are the number of datafiles (by default) each file you're
> >> > writing to
> >> > is using? 3?
> >> > 5) When you watch network bandwidth and see 10 MB/s where is that? On
> >> > the
> >> > server?
> >> > 6) What backend are you using for I/O, direct or alt-aio. Nothing
> really
> >> > wrong either way, just wondering.
> >> >
> >> > It sounds like based on the dd output the disks are capable of more
> than
> >> > you're seeing, just need to narrow down where the performance is
> getting
> >> > squelched.
> >> >
> >> > Michael
> >> >
> >> > On Wed, Sep 28, 2011 at 6:10 PM, Jim Kusznir <[email protected]>
> wrote:
> >> >>
> >> >> Hi all:
> >> >>
> >> >> I've got a pvfs2 install on my cluster.  I never felt it was
> >> >> performing up to snuff, but lately it seems that things have gone
> way,
> >> >> way down in total throughput and overall usability.  To the tune that
> >> >> jobs writing out 900MB will take an extra 1-2 hours to complete due
> to
> >> >> disk I/O waits.  A 2-hr job that would write about 30GB over the
> >> >> course of the run (normally about 2hrs long) takes up to 20hrs.  Once
> >> >> the disk I/O is cut out, it completes in 1.5-2hrs.  I've noticed
> >> >> personally that there's up to a 5 sec lag time when I cd into
> >> >> /mnt/pvfs2 and do an ls.  Note that all of our operations are using
> >> >> the kernel module / mount point.  Our problems and code base do not
> >> >> support the use of other tools (such as the pvfs2-* or the native MPI
> >> >> libraries); its all done through the kernel module / filesystem
> >> >> mountpoint.
> >> >>
> >> >> My configuration is this:  3 pvfs2 servers (Dell PowerEdge 1950's
> with
> >> >> 1.6Ghz quad-core CPUs, 4GB ram, raid-0 for metadata+os on perc5i
> >> >> card), Dell Perc6e card with hardware raid6 in two volumes: one on a
> >> >> bunch of 750GB sata drives, and the other on its second SAS connector
> >> >> to about 12 2tb WD drives.  The two raid volumes are lvm'ed together
> >> >> in the OS and mounted as the pvfs2 data store.  Each server is
> >> >> connected via ethernet to a stack of LG-errison gig-e switches
> >> >> (stack==2 switches with 40Gbit stacking cables installed).  PVFS
> 2.8.2
> >> >> used throughout the cluster on Rocks (using site-compiled pvfs, not
> >> >> the rocks-supplied pvfs).  OSes are CentOS5-x-based (both clients and
> >> >> servers).
> >> >>
> >> >> As I said, I always felt something wasn't quite right, but a few
> >> >> months back, I performed a series of upgrades and reconfigurations on
> >> >> the infrastructure and hardware.  Specifically, I upgraded to the
> >> >> lg-errison switches and replaced a full 12-bay drive shelf with a
> >> >> 24-bay one (moving all the disks through) and adding some additional
> >> >> disks.  All three pvfs2 servers are identical in this.  At some point
> >> >> prior to these changes, my users were able to get acceptable
> >> >> performance from pvfs2; now they are not.  I don't have any evidence
> >> >> pointing to the switch or to the disks.
> >> >>
> >> >> I can run dd if=/dev/zero of=testfile bs=1024k count=10000 and get
> >> >> 380+MB/s locally on the pvfs server, writing to the partition on the
> >> >> hardware raid6 card.  From a compute node, doing that for 100MB file,
> >> >> I get 47.7MB/s to my RAID-5 NFS server on the head node, and 36.5MB/s
> >> >> to my pvfs2 mounted share.  When I watch the network
> >> >> bandwidth/throughput using bwm-ng, I rarely see more than 10MB/s, and
> >> >> often its around 4MB/s with a 12-node IO-bound job running.
> >> >>
> >> >> I originally had the pvfs2 servers connected to the switch with dual
> >> >> gig-e connections and using bonding (ALB) to make it more able to
> >> >> serve multiple nodes.  I never saw anywhere close to the throughput I
> >> >> should.  In any case, to test of that was the problem, I removed the
> >> >> bonding and am running through a single gig-e pipe now, but
> >> >> performance hasn't improved at all.
> >> >>
> >> >> I'm not sure how to troubleshoot this problem further.  Presently,
> the
> >> >> cluster isn't usable for large I/O jobs, so I really have to fix
> this.
> >> >>
> >> >> --Jim
> >> >> _______________________________________________
> >> >> Pvfs2-users mailing list
> >> >> [email protected]
> >> >> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
> >> >
> >> >
> >
> >
>
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to