Hi Jim,

Sorry for the delay in responding. Based on what you described it seems
that the problem is both small writes from the application and an issue
with the number of IOPS you're getting to storage. From the snippet below
the average request size (in 512 blocks) is 68 or 34Kb. Modifying the
application to do larger writes would be one option. There are some tuning
changes you can play around that may help get more I/O to the disk on the
server side.

In addition to mkfs.xfs options that have been mentioned in /sys/block/
there are entries both for block devices and device mapper devices that
might help. The tthree I would start with are:
/sys/block/<dev>/queue/max_hw_sectors_kb
/sys/block/<dev>/queue/nr_requests
/sys/block/<dev>/queue/scheduler

For any multipath or mapper devices you can change those value too
(although I think nr_requests if fixed at 128 on dms). This URL may be of
some help: http://people.redhat.com/msnitzer/docs/io-limits.txt. There is
other documentation in the kernel source Documentation/ directory if you
really want to dig in.

In testing here bumping nr_requests up to 512 (form 128) increased overall
throughput for streaming workloads, not sure what you'll see for small
and/or random I/O. Also, setting max_hw_sectors_kb to the stripe size of
the underlying RAID can help for larger I/O but again may not be helpful in
this situation. Playing with the various I/O schedulers might help as well,
deadline is a popular choice.

You RAID controller may have settings like queue depth you can modify to
help get higher throughput / IOPS.

Michael

On Mon, Oct 24, 2011 at 3:32 PM, Jim Kusznir <[email protected]> wrote:

> Ok, so further studies and catching something in-process.  The iostat
> reveals on one server (all three looking fairly similar):
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           0.97    0.00    1.62   11.22    0.00   86.19
>
> Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz
> avgqu-sz   await  svctm  %util
> sda               0.00     0.00  0.00  0.00     0.00     0.00     0.00
>    0.00    0.00   0.00   0.00
> sda1              0.00     0.00  0.00  0.00     0.00     0.00     0.00
>    0.00    0.00   0.00   0.00
> sda2              0.00     0.00  0.00  0.00     0.00     0.00     0.00
>    0.00    0.00   0.00   0.00
> sdb               3.24     0.00 29.93 15.96     0.85     0.04    39.94
>    0.74   15.98  10.35  47.51
> sdb1              3.24     0.00 29.93 15.96     0.85     0.04    39.94
>    0.74   15.98  10.35  47.51
> sdc              14.46     0.00 48.63  0.25     2.87     0.00   120.41
>    2.08   42.41  14.14  69.10
> hda               0.00     0.00  0.00  0.00     0.00     0.00     0.00
>    0.00    0.00   0.00   0.00
> dm-0              0.00     0.00 96.51 16.21     3.72     0.04    68.38
>    3.58   31.77   7.64  86.08
>
> I figure the last line (dm-0) is the most important, as its the end
> device with the xfs filesystem on it, and it is comprised of sdb1 and
> sdc.  What has me concerned is that it is 86% utilized, yet is only
> moving 3.72MB/s out and 0.04MB/s in...Am I reading this correctly?
> Does this point toward the actual problem?
>
> Just a few minutes previous, I ran an I/O test on a compute node with
> dd and 1MB and 512kB block sizes, and repeating both case a few times
> and both in the default block size pvfs2 directory and one locked at
> 1MB block sizes.  All results returned between 50 and 75MB/s  That
> tends to show that decent performance is possible, yet I definitely
> don't see the performance in actual runs.
>
> Where do I go from here?
>
> --Jim
>
> On Thu, Oct 13, 2011 at 9:49 AM, Jim Kusznir <[email protected]> wrote:
> > Oh, and while watching iostat on my 3 pvfs servers, I noticed that the
> > "middle" one had notably less I/O reported than the other two (which
> > both peaked in the 80% during portions of the write; the biggest on
> > the middle was about 60%).
> >
> > --Jim
> >
> > On Thu, Oct 13, 2011 at 9:45 AM, Jim Kusznir <[email protected]> wrote:
> >> The iostat was running while some IO jobs were running from at least
> >> some nodes.  I'm not sure exactly what portion of the job they were in
> >> (I don't run jobs, I run the system....), but I watched it for a
> >> while.  I did eventually see 50% I/O at times.
> >>
> >> I definitely saw a disproportionate amount of I/O to one of the two
> >> devices.  I did not specify any stripe size when I built the LVM
> >> (didn't know anything about it), so that's probably the problem with
> >> the disproportionate I/O.  Is there a way to correct that
> >> non-destructively?
> >>
> >> ping times:
> >> rtt min/avg/max/mdev = 0.114/0.777/3.079/1.014 ms
> >>
> >> Did some testing with the stripe size.  I think I did what was asked:
> >>
> >> [root@compute-0-2 ~]# cd /mnt/pvfs2/kusznir
> >> [root@compute-0-2 kusznir]# setfattr -n user.pvfs2.dist_name -v
> simple_stripe .
> >> [root@compute-0-2 kusznir]# setfattr -n user.pvfs2.dist_params -v
> >> strip_size:1048576 .
> >> [root@compute-0-2 kusznir]# dd if=/dev/zero of=testfile bs=1024k
> count=1024
> >> 1024+0 records in
> >> 1024+0 records out
> >> 1073741824 bytes (1.1 GB) copied, 11.2155 seconds, 95.7 MB/s
> >> [root@compute-0-2 kusznir]# pvfs2-viewdist -f testfile
> >> dist_name = simple_stripe
> >> dist_params:
> >> strip_size:65536
> >>
> >> Metadataserver: tcp://pvfs2-io-0-1:3334
> >> Number of datafiles/servers = 3
> >> Datafile 0 - tcp://pvfs2-io-0-1:3334, handle: 3571633946
> (d4e2cf1a.bstream)
> >> Datafile 1 - tcp://pvfs2-io-0-2:3334, handle: 4288072941
> (ff96cced.bstream)
> >> Datafile 2 - tcp://pvfs2-io-0-0:3334, handle: 2856061933
> (aa3c0bed.bstream)
> >> [root@compute-0-2 kusznir]#
> >>
> >> --Jim
> >>
> >> On Tue, Oct 11, 2011 at 1:39 PM, Michael Moore <[email protected]>
> wrote:
> >>> Tp clarify, these utilization % numbers were during a job running on
> some
> >>> number of clients that was I/O bound? The server side was/is not CPU
> bound,
> >>> right?
> >>>
> >>> When you LVM'd the two RAID together did you specify number stripes and
> >>> stripe width of the logical volumes? Specifically, did you use the
> --stripes
> >>> and --stripesize options to lvcreate? Or none? I would expect that you
> did
> >>> not based on the behavior you're seeing.
> >>>
> >>> I know originally you said you were getting 30MBps when doing a 1MB
> block
> >>> size dd. Could you do that same test now in a directory with the
> stripe size
> >>> set to 1M as I mentioned in previous e-mails.
> >>>
> >>> What's the network latency between a compute node and PVFS server when
> doing
> >>> a ping. I would expect something in the ballpark of:
> >>> rtt min/avg/max/mdev = 0.126/0.159/0.178/0.019 ms
> >>>
> >>> Michael
> >>>
> >>> On Tue, Oct 11, 2011 at 2:33 PM, Jim Kusznir <[email protected]>
> wrote:
> >>>>
> >>>> I finally did manage to do this, and the results were a bit
> >>>> interesting.  First, the highest amount I saw in the %utilization
> >>>> column was 16% on one server, and that was only there for 1
> >>>> measurement period.  Typical maximums were 7%.
> >>>>
> >>>> The interesting part was that my second server was rarely over 1%, my
> >>>> first server was 4-7% and my 3rd server was 5-9%.
> >>>>
> >>>> The other interesting part was where the I/O was principally
> >>>> happening.  Originally, I had 8TB of 750GB SATA disks (in a hardware
> >>>> raid-6), and then I added a second RAID-6 of 2TB disks which has the
> >>>> majority of the disk space.  The two are lvm'ed together.  So far,
> >>>> nearly all the %utilization numbers were showing up on the 750GB
> >>>> disks.
> >>>>
> >>>> I have been running xfs_fsr to get the fragmentation down.  My 3rd
> >>>> node is still at 17%; the first node is at 5%, and the 2nd node is at
> >>>> 0.7%.  I've put in a cron job to run xfs_fsr for 4 hours each Sunday
> >>>> night starting at midnight (when my cluster is usually idle anyway) to
> >>>> try and improve/manage that.  I'm not sure if there is actually a
> >>>> causality relationship here, but the load% seems to follow the frag%
> >>>> (higher frag, higher load).
> >>>>
> >>>> Still, the fact that ti peaks out so low has me questioning what's
> going
> >>>> on...
> >>>>
> >>>> Watching it a bit longer into another workload, and I do see %use
> >>>> spike up to 35%, but network I/O (as measured by bwm-ng) still peaks
> >>>> at 8MB/s on pure gig-e (which should be capable of 90MB/s).
> >>>>
> >>>> --Jim
> >>>>
> >>>> On Thu, Oct 6, 2011 at 1:36 PM, Emmanuel Florac <
> [email protected]>
> >>>> wrote:
> >>>> > Le Wed, 5 Oct 2011 08:44:11 -0700 vous écriviez:
> >>>> >
> >>>> >>  I don't
> >>>> >> know how to watch actual IOPS or other more direct metrics.
> >>>> >
> >>>> > Use the iostat command, something like
> >>>> >
> >>>> > iostat -mx 4
> >>>> >
> >>>> > you'll have a very detailed report on disk activity. The percentage
> of
> >>>> > usage (last column to the right) might be interesting. Let it run
> for a
> >>>> > while and see if there's a pattern.
> >>>> >
> >>>> > --
> >>>> >
> ------------------------------------------------------------------------
> >>>> > Emmanuel Florac     |   Direction technique
> >>>> >                    |   Intellique
> >>>> >                    |   <[email protected]>
> >>>> >                    |   +33 1 78 94 84 02
> >>>> >
> ------------------------------------------------------------------------
> >>>> >
> >>>>
> >>>> _______________________________________________
> >>>> Pvfs2-users mailing list
> >>>> [email protected]
> >>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
> >>>
> >>>
> >>
> >
>
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to