All speeds were in Mpbs, the default from iperf.

Our files are multi-GB in size, so they do involve all three servers.
It applies to all files on the system.

Can I change the stripe size "on the go"?  I already have about 50TB
of data in the system, and have no place big enough to back it up to
rebuild the pvfs2 array and restore....

--Jim

On Fri, Sep 30, 2011 at 1:46 PM, Michael Moore <[email protected]> wrote:
> See below for specific items. Can you run iostat on the servers while
> writing a file that experiences the slow performance? If you could watch
> iostat -dmx <device of pvfs storage space> and provide any salient snippets
> (high utilization, low utilization, odd looking output, etc) that could
> help.
>
> On Thu, Sep 29, 2011 at 11:42 AM, Jim Kusznir <[email protected]> wrote:
>>
>> 1) iperf (defaults) reported 873, 884, and 929 for connections form
>> the three servers to the head node (a pvfs2 client)
>
> Just to be clear, those are Mbps, right?
>
>>
>> 2) no errors showed up on any of the ports on the managed switch.
>
> Hmm, if those are Mbps not seeming to be a network layer.
>>
>> 3) I'm not sure what this will do, as the pvfs2 volume is comprised of
>> 3 servers, so mounting it on a server still uses the network for the
>> other two.  I also don't understand "single file per datafile"
>> statement.  In any case, I do not have the kernel module compiled on
>> my servers; they ONLY have the pvfs2 server software installed.
>>
>
> A logical file (e.g. foo.out) in a PVFS2 file system is made up of one or
> more datafiles. Based on your config I would assume most are made up of 3
> datafiles with the default stripe size of 64k.
>
> You can run pvfs2-viewdist -f <file name> to see what the distribution and
> what servers a given file lives on. To see cumulative throughput from
> multiple PVFS2 servers the number of datafiles must be greater than one.
> Check a couple of the problematic files to see what their distribution is.
>
> For a quick test to see if the distribution is impacting performance set the
> following extended attribute on a directory and then check the performance
> of writing a file into it:
> setfattr -n user.pvfs2.num_dfiles -v "3" <some pvfs2 dir>
>
> Also, you can test if a larger strip_size would help doing a something
> similar to (for 256k strip):
> setfattr -n user.pvfs2.dist_name -v simple_stripe <some pvfs2 dir>
> setfattr -n user.pvfs2.dist_params -v strip_size:262144 <some pvfs2 dir>
>
>>
>> 4) I'm not sure; I used largely defaults.  I've attached my config below.
>>
>> 5) the network bandwidth is on one of the servers (the one I checked;
>> I believe them to all be similar).
>>
>> 6) Not sure.  I have created an XFS filesystem using LVM to combine
>> the two hardware raid6 volumes and mounted that at /mnt/pvfs2 on the
>> servers.  I then let pvfs do its magic.  Config files below.
>>
>> 7(from second e-mail): Config file attached.
>>
>> ----------
>> /etc/pvfs2-fs.conf:
>> ----------
>> [root@pvfs2-io-0-2 mnt]# cat /etc/pvfs2-fs.conf
>> <Defaults>
>>        UnexpectedRequests 50
>>        EventLogging none
>>        LogStamp datetime
>>        BMIModules bmi_tcp
>>        FlowModules flowproto_multiqueue
>>        PerfUpdateInterval 1000
>>        ServerJobBMITimeoutSecs 30
>>        ServerJobFlowTimeoutSecs 30
>>        ClientJobBMITimeoutSecs 300
>>        ClientJobFlowTimeoutSecs 300
>>        ClientRetryLimit 5
>>        ClientRetryDelayMilliSecs 2000
>>        StorageSpace /mnt/pvfs2
>>        LogFile /var/log/pvfs2-server.log
>> </Defaults>
>>
>> <Aliases>
>>        Alias pvfs2-io-0-0 tcp://pvfs2-io-0-0:3334
>>        Alias pvfs2-io-0-1 tcp://pvfs2-io-0-1:3334
>>        Alias pvfs2-io-0-2 tcp://pvfs2-io-0-2:3334
>> </Aliases>
>>
>> <Filesystem>
>>        Name pvfs2-fs
>>        ID 62659950
>>        RootHandle 1048576
>>        <MetaHandleRanges>
>>                Range pvfs2-io-0-0 4-715827885
>>                Range pvfs2-io-0-1 715827886-1431655767
>>                Range pvfs2-io-0-2 1431655768-2147483649
>>        </MetaHandleRanges>
>>        <DataHandleRanges>
>>                Range pvfs2-io-0-0 2147483650-2863311531
>>                Range pvfs2-io-0-1 2863311532-3579139413
>>                Range pvfs2-io-0-2 3579139414-4294967295
>>        </DataHandleRanges>
>>        <StorageHints>
>>                TroveSyncMeta yes
>>                TroveSyncData no
>>        </StorageHints>
>> </Filesystem>
>>
>>
>> ---------------------
>> /etc/pvfs2-server.conf-pvfs2-io-0-2
>> ---------------------
>> StorageSpace /mnt/pvfs2
>> HostID "tcp://pvfs2-io-0-2:3334"
>> LogFile /var/log/pvfs2-server.log
>> ---------------------
>>
>> All the server config files are very similar.
>>
>> --Jim
>>
>>
>> On Wed, Sep 28, 2011 at 4:45 PM, Michael Moore <[email protected]>
>> wrote:
>> > No doubt something is awry. Offhand I'm suspecting the network. A couple
>> > things that might help give a direction:
>> > 1) Do an end-to-end TCP test between client/server. Something like iperf
>> > or
>> > nuttcp should do the trick.
>> > 2) Check server and client ethernet ports on the switch for high error
>> > counts (not familiar with that switch, not sure if it's managed or not).
>> > Hardware (port/cable) errors should show up in the above test.
>> > 3) Can you mount the PVFS2 file system on the server and run some I/O
>> > tests
>> > (single datafile per file) to see if the network is in fact in play.
>> > 4) What are the number of datafiles (by default) each file you're
>> > writing to
>> > is using? 3?
>> > 5) When you watch network bandwidth and see 10 MB/s where is that? On
>> > the
>> > server?
>> > 6) What backend are you using for I/O, direct or alt-aio. Nothing really
>> > wrong either way, just wondering.
>> >
>> > It sounds like based on the dd output the disks are capable of more than
>> > you're seeing, just need to narrow down where the performance is getting
>> > squelched.
>> >
>> > Michael
>> >
>> > On Wed, Sep 28, 2011 at 6:10 PM, Jim Kusznir <[email protected]> wrote:
>> >>
>> >> Hi all:
>> >>
>> >> I've got a pvfs2 install on my cluster.  I never felt it was
>> >> performing up to snuff, but lately it seems that things have gone way,
>> >> way down in total throughput and overall usability.  To the tune that
>> >> jobs writing out 900MB will take an extra 1-2 hours to complete due to
>> >> disk I/O waits.  A 2-hr job that would write about 30GB over the
>> >> course of the run (normally about 2hrs long) takes up to 20hrs.  Once
>> >> the disk I/O is cut out, it completes in 1.5-2hrs.  I've noticed
>> >> personally that there's up to a 5 sec lag time when I cd into
>> >> /mnt/pvfs2 and do an ls.  Note that all of our operations are using
>> >> the kernel module / mount point.  Our problems and code base do not
>> >> support the use of other tools (such as the pvfs2-* or the native MPI
>> >> libraries); its all done through the kernel module / filesystem
>> >> mountpoint.
>> >>
>> >> My configuration is this:  3 pvfs2 servers (Dell PowerEdge 1950's with
>> >> 1.6Ghz quad-core CPUs, 4GB ram, raid-0 for metadata+os on perc5i
>> >> card), Dell Perc6e card with hardware raid6 in two volumes: one on a
>> >> bunch of 750GB sata drives, and the other on its second SAS connector
>> >> to about 12 2tb WD drives.  The two raid volumes are lvm'ed together
>> >> in the OS and mounted as the pvfs2 data store.  Each server is
>> >> connected via ethernet to a stack of LG-errison gig-e switches
>> >> (stack==2 switches with 40Gbit stacking cables installed).  PVFS 2.8.2
>> >> used throughout the cluster on Rocks (using site-compiled pvfs, not
>> >> the rocks-supplied pvfs).  OSes are CentOS5-x-based (both clients and
>> >> servers).
>> >>
>> >> As I said, I always felt something wasn't quite right, but a few
>> >> months back, I performed a series of upgrades and reconfigurations on
>> >> the infrastructure and hardware.  Specifically, I upgraded to the
>> >> lg-errison switches and replaced a full 12-bay drive shelf with a
>> >> 24-bay one (moving all the disks through) and adding some additional
>> >> disks.  All three pvfs2 servers are identical in this.  At some point
>> >> prior to these changes, my users were able to get acceptable
>> >> performance from pvfs2; now they are not.  I don't have any evidence
>> >> pointing to the switch or to the disks.
>> >>
>> >> I can run dd if=/dev/zero of=testfile bs=1024k count=10000 and get
>> >> 380+MB/s locally on the pvfs server, writing to the partition on the
>> >> hardware raid6 card.  From a compute node, doing that for 100MB file,
>> >> I get 47.7MB/s to my RAID-5 NFS server on the head node, and 36.5MB/s
>> >> to my pvfs2 mounted share.  When I watch the network
>> >> bandwidth/throughput using bwm-ng, I rarely see more than 10MB/s, and
>> >> often its around 4MB/s with a 12-node IO-bound job running.
>> >>
>> >> I originally had the pvfs2 servers connected to the switch with dual
>> >> gig-e connections and using bonding (ALB) to make it more able to
>> >> serve multiple nodes.  I never saw anywhere close to the throughput I
>> >> should.  In any case, to test of that was the problem, I removed the
>> >> bonding and am running through a single gig-e pipe now, but
>> >> performance hasn't improved at all.
>> >>
>> >> I'm not sure how to troubleshoot this problem further.  Presently, the
>> >> cluster isn't usable for large I/O jobs, so I really have to fix this.
>> >>
>> >> --Jim
>> >> _______________________________________________
>> >> Pvfs2-users mailing list
>> >> [email protected]
>> >> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>> >
>> >
>
>

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to