Ok, so further studies and catching something in-process.  The iostat
reveals on one server (all three looking fairly similar):

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.97    0.00    1.62   11.22    0.00   86.19

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz
avgqu-sz   await  svctm  %util
sda               0.00     0.00  0.00  0.00     0.00     0.00     0.00
    0.00    0.00   0.00   0.00
sda1              0.00     0.00  0.00  0.00     0.00     0.00     0.00
    0.00    0.00   0.00   0.00
sda2              0.00     0.00  0.00  0.00     0.00     0.00     0.00
    0.00    0.00   0.00   0.00
sdb               3.24     0.00 29.93 15.96     0.85     0.04    39.94
    0.74   15.98  10.35  47.51
sdb1              3.24     0.00 29.93 15.96     0.85     0.04    39.94
    0.74   15.98  10.35  47.51
sdc              14.46     0.00 48.63  0.25     2.87     0.00   120.41
    2.08   42.41  14.14  69.10
hda               0.00     0.00  0.00  0.00     0.00     0.00     0.00
    0.00    0.00   0.00   0.00
dm-0              0.00     0.00 96.51 16.21     3.72     0.04    68.38
    3.58   31.77   7.64  86.08

I figure the last line (dm-0) is the most important, as its the end
device with the xfs filesystem on it, and it is comprised of sdb1 and
sdc.  What has me concerned is that it is 86% utilized, yet is only
moving 3.72MB/s out and 0.04MB/s in...Am I reading this correctly?
Does this point toward the actual problem?

Just a few minutes previous, I ran an I/O test on a compute node with
dd and 1MB and 512kB block sizes, and repeating both case a few times
and both in the default block size pvfs2 directory and one locked at
1MB block sizes.  All results returned between 50 and 75MB/s  That
tends to show that decent performance is possible, yet I definitely
don't see the performance in actual runs.

Where do I go from here?

--Jim

On Thu, Oct 13, 2011 at 9:49 AM, Jim Kusznir <[email protected]> wrote:
> Oh, and while watching iostat on my 3 pvfs servers, I noticed that the
> "middle" one had notably less I/O reported than the other two (which
> both peaked in the 80% during portions of the write; the biggest on
> the middle was about 60%).
>
> --Jim
>
> On Thu, Oct 13, 2011 at 9:45 AM, Jim Kusznir <[email protected]> wrote:
>> The iostat was running while some IO jobs were running from at least
>> some nodes.  I'm not sure exactly what portion of the job they were in
>> (I don't run jobs, I run the system....), but I watched it for a
>> while.  I did eventually see 50% I/O at times.
>>
>> I definitely saw a disproportionate amount of I/O to one of the two
>> devices.  I did not specify any stripe size when I built the LVM
>> (didn't know anything about it), so that's probably the problem with
>> the disproportionate I/O.  Is there a way to correct that
>> non-destructively?
>>
>> ping times:
>> rtt min/avg/max/mdev = 0.114/0.777/3.079/1.014 ms
>>
>> Did some testing with the stripe size.  I think I did what was asked:
>>
>> [root@compute-0-2 ~]# cd /mnt/pvfs2/kusznir
>> [root@compute-0-2 kusznir]# setfattr -n user.pvfs2.dist_name -v 
>> simple_stripe .
>> [root@compute-0-2 kusznir]# setfattr -n user.pvfs2.dist_params -v
>> strip_size:1048576 .
>> [root@compute-0-2 kusznir]# dd if=/dev/zero of=testfile bs=1024k count=1024
>> 1024+0 records in
>> 1024+0 records out
>> 1073741824 bytes (1.1 GB) copied, 11.2155 seconds, 95.7 MB/s
>> [root@compute-0-2 kusznir]# pvfs2-viewdist -f testfile
>> dist_name = simple_stripe
>> dist_params:
>> strip_size:65536
>>
>> Metadataserver: tcp://pvfs2-io-0-1:3334
>> Number of datafiles/servers = 3
>> Datafile 0 - tcp://pvfs2-io-0-1:3334, handle: 3571633946 (d4e2cf1a.bstream)
>> Datafile 1 - tcp://pvfs2-io-0-2:3334, handle: 4288072941 (ff96cced.bstream)
>> Datafile 2 - tcp://pvfs2-io-0-0:3334, handle: 2856061933 (aa3c0bed.bstream)
>> [root@compute-0-2 kusznir]#
>>
>> --Jim
>>
>> On Tue, Oct 11, 2011 at 1:39 PM, Michael Moore <[email protected]> wrote:
>>> Tp clarify, these utilization % numbers were during a job running on some
>>> number of clients that was I/O bound? The server side was/is not CPU bound,
>>> right?
>>>
>>> When you LVM'd the two RAID together did you specify number stripes and
>>> stripe width of the logical volumes? Specifically, did you use the --stripes
>>> and --stripesize options to lvcreate? Or none? I would expect that you did
>>> not based on the behavior you're seeing.
>>>
>>> I know originally you said you were getting 30MBps when doing a 1MB block
>>> size dd. Could you do that same test now in a directory with the stripe size
>>> set to 1M as I mentioned in previous e-mails.
>>>
>>> What's the network latency between a compute node and PVFS server when doing
>>> a ping. I would expect something in the ballpark of:
>>> rtt min/avg/max/mdev = 0.126/0.159/0.178/0.019 ms
>>>
>>> Michael
>>>
>>> On Tue, Oct 11, 2011 at 2:33 PM, Jim Kusznir <[email protected]> wrote:
>>>>
>>>> I finally did manage to do this, and the results were a bit
>>>> interesting.  First, the highest amount I saw in the %utilization
>>>> column was 16% on one server, and that was only there for 1
>>>> measurement period.  Typical maximums were 7%.
>>>>
>>>> The interesting part was that my second server was rarely over 1%, my
>>>> first server was 4-7% and my 3rd server was 5-9%.
>>>>
>>>> The other interesting part was where the I/O was principally
>>>> happening.  Originally, I had 8TB of 750GB SATA disks (in a hardware
>>>> raid-6), and then I added a second RAID-6 of 2TB disks which has the
>>>> majority of the disk space.  The two are lvm'ed together.  So far,
>>>> nearly all the %utilization numbers were showing up on the 750GB
>>>> disks.
>>>>
>>>> I have been running xfs_fsr to get the fragmentation down.  My 3rd
>>>> node is still at 17%; the first node is at 5%, and the 2nd node is at
>>>> 0.7%.  I've put in a cron job to run xfs_fsr for 4 hours each Sunday
>>>> night starting at midnight (when my cluster is usually idle anyway) to
>>>> try and improve/manage that.  I'm not sure if there is actually a
>>>> causality relationship here, but the load% seems to follow the frag%
>>>> (higher frag, higher load).
>>>>
>>>> Still, the fact that ti peaks out so low has me questioning what's going
>>>> on...
>>>>
>>>> Watching it a bit longer into another workload, and I do see %use
>>>> spike up to 35%, but network I/O (as measured by bwm-ng) still peaks
>>>> at 8MB/s on pure gig-e (which should be capable of 90MB/s).
>>>>
>>>> --Jim
>>>>
>>>> On Thu, Oct 6, 2011 at 1:36 PM, Emmanuel Florac <[email protected]>
>>>> wrote:
>>>> > Le Wed, 5 Oct 2011 08:44:11 -0700 vous écriviez:
>>>> >
>>>> >>  I don't
>>>> >> know how to watch actual IOPS or other more direct metrics.
>>>> >
>>>> > Use the iostat command, something like
>>>> >
>>>> > iostat -mx 4
>>>> >
>>>> > you'll have a very detailed report on disk activity. The percentage of
>>>> > usage (last column to the right) might be interesting. Let it run for a
>>>> > while and see if there's a pattern.
>>>> >
>>>> > --
>>>> > ------------------------------------------------------------------------
>>>> > Emmanuel Florac     |   Direction technique
>>>> >                    |   Intellique
>>>> >                    |   <[email protected]>
>>>> >                    |   +33 1 78 94 84 02
>>>> > ------------------------------------------------------------------------
>>>> >
>>>>
>>>> _______________________________________________
>>>> Pvfs2-users mailing list
>>>> [email protected]
>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>>
>>>
>>
>

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to