Andrew:

Please try the scenario where you have 2 servers, each handling data and
metadata, no dedicated metadata server, and see if you get better
performance in the 2 server scenario.

Becky Ligon

On Mon, Apr 23, 2012 at 2:11 PM, Andrew Savchenko <[email protected]> wrote:

> Hello,
>
> I stumbled upon severe read performance degradation when following
> three requirements are met:
> 1) Read block size is larger than orangefs stripe size.
> 2) Read block size is larger than 64 KB (or 128 KB sometimes).
> 3) There is more than single data server.
>
> Write performance is not affected by this. Read performance
> degradation is huge in magnitude: from 30 times to 100000 times (yes,
> slowdown by 5 degrees: from 100 MB/s to 1 KB/s).
>
> I use orangefs-2.8.5 with 3.2.14 kernel. Full configure options are
> available in the README file in the attached tarball (in short: aio,
> kernel-threads, bmi_tcp, mmap, sendfile were enable by default). Some
> of build options were varied during testing. I used kernel VFS
> interface for tests.
>
> My hardware setup is:
> 16 ProLiant BL2x220c G5 hosts; per host setup:
> 32 GB RAM, 2 x Xeon E5450 @3GHz.
>
> Network is 1 Gb/s, Full duplex, no flow control (this slightly
> improves results), 9000 jumbo frames. Network switch is HP GbE2c
> Layer 2/3 Ethernet Blade Switch.
>
> Of 16 nodes up to 15 were used as servers, client node was always
> dedicated.
>
> I ran tests from a single host using mostly iozone with varying
> record sizes altogether with varying PVFS2 stripe sizes. Some tests
> were made using other utilities (cp, cat, pvfs2-cp, dd). I tried
> various network sysctl settings with a little positive effect.
>
> All tests results are available in the attached tarball, README file
> there describes meanings or directory and file names.
>
> Both pvfs2-server and pvfs2-client were restarted before each set
> of tests, kernel module was reloaded. For tests with different
> filesystem options filesystem was recreated.
>
> With a single data server results are good and understandable: poor
> performance for small block sizes and good for large ones (up to 105
> MB/s, which is close to network limit). This is valid for both cases:
> 1 +1 (one dedicated metadata server and one dedicated data server) and
> 1:1 (single server serves as both data and metadata servers.
>
> Problem appears when I add just a single data server (e.g. 1+2
> configuration), default 64 KB stripe size:
>  KB  reclen   write rewrite    read    reread
>  1024       4    6235    6334     6358     6350
>  1024       8   10424   10583    10456    10395
>  1024      16   14861   15085    18639    18443
>  1024      32   26593   27130    30873    31693
>  1024      64   39460   40759    44671    46421
>  1024     128   60462   60880    66502    67230
>  1024     256   78221   79073     2031     4037
>  1024     512   91542   91526    93568    94542
>  1024    1024  101808  102329   100965    97235
> 10240       4    6133    6193     6166     6106
> 10240       8   10753   10696    10663    10650
> 10240      16   15011   15088    18496    18485
> 10240      32   26521   26871    31035    31070
> 10240      64   40843   41602    47019    48037
> 10240     128   59882   60358    66284    66707
> 10240     256   78225   78644     6635     5117
> 10240     512   91103   91934    25870    20008
> 10240    1024  101995  101857    27425    11925
> 10240    2048  108281  108871    17069    14235
> (full table you may find in the tarball)
>
> As you can see, read performance drops drastically after 128 KB block
> size. With large number of servers an overall picture is the same,
> but performance degradation is even worse.
>
> This can't be disk performance issue, because with single data server
> things are ok. This is unlikely to be an incast problem, because just
> two servers are sufficient to reproduce the issue. Also I monitored
> network, disk activities and CPU utilization. I found no bottlenecks
> so far.
>
> It looks like some buffer throttling to me, but I failed to find that
> buffer.
>
> First I tried to increase network buffers: rmem_max, wmem_max,
> tcp_rmem, tcp_wmem up to 16 MB is several steps. Just slight
> degradation was observed. Test results are in *-mem2? directories.
> I also tried to increase 2nd (default value) field in
> tcp_rmem/tcp_wmem to 1MB with no effect.
>
> Than I tried to disable experimental options enabled in my original
> setup: threaded-kmod-helper, kernel-sendfile, mmap-racache. I found
> that sendfile support degrades performance for small access patterns
> and somewhat increases for large. After this and reading
> sendfile-related threads in pvfs2 mail lists I decided to disable
> this option for better overall throughoutput of the setup. Other
> options have no considerable effect on my tests. Results are in
> -no(*) directories.
>
> Afterwards I tried to increase pvfs2 server buffers manually using
> TCPBufferSend/TCPBufferReceive options up to 2 MB. I got no effect.
> Results are in -pvfsbuf-* directories.
>
> As you can see in my benchmark script, I set stripe sizes using:
> setfattr -n user.pvfs2.dist_params -v strip_size:$((s*1024)) $name
> Changes were confirmed using pvfs2-viedist.
> To check that default filesystem distribution has no effect I changed
> default FS stripe size to 256 KB. It has no effect. Results are in
> -deffsstrip-256k.
>
> I found an interesting discussion at pvfs2-users about a similar
> issue:
>
> http://www.beowulf-underground.org/pipermail/pvfs2-users/2009-July/002857.html
> I tried sysctl parameters recommended here:
>
> http://www.beowulf-underground.org/pipermail/pvfs2-users/2009-July/002865.html
> with disastrous consequences. You may see results in
> -recommended-sysctl directory. Test was terminated because it took
> too long. Read speed after "buffering" limit dropped to 1 KB/s (yes,
> one kilobyte per second). I was shocked, I checked this with dd,
> result was the same.
>
> Then I tried to find out what exact option caused this unimaginable
> bottleneck. It occurred to be disabled tcp_timestamps:
> net.ipv4.tcp_timestamps = 0
> (See -recommended-sysctl[23] for bisection results.)
> Other options had no measurable impact.
>
> I don't know why disabled timestamps caused such disaster, maybe this
> is a clue for the original issue, but I don't understand this. To my
> knowledge tcp timestamps should have no effect on such small data
> samples at all.
>
> Afterwards I tried to find stripe size for "normal" cp/cat
> operations for the best conditions I managed to found: 15:15-no
> (sendfile) — 15 servers serving both data and metadata with sendfile
> support disabled. Test results are in -utils directory. I found that
> 1 MB stripe size gives best performance and further growing of stripe
> sizes gives no significant benefit.
>
> As a conclusion:
>
> I still can't find the reason of the original bottleneck. I suspend
> this is some kind of insufficient buffer condition but I can't find
> that buffer. Any help will be highly welcomed.
>
> Best regards,
> Andrew Savchenko
>
> _______________________________________________
> Pvfs2-users mailing list
> [email protected]
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>
>


-- 
Becky Ligon
OrangeFS Support and Development
Omnibond Systems
Anderson, South Carolina
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to