Hi Andrew -

Adding on to what Becky mentioned,
Can you run some netpipe tcp runs to see how your network is
performing, just to double/triple-check ?

I typically set the FlowBufferSize to something higher like 1MB (or
better yet, to something that aligns with netpipe's results for higher
bandwidth), have you tried this?-- it probably wont fix the problem,
but may give modestly improved results.

Kyle Schochenmaier



On Mon, Apr 23, 2012 at 1:40 PM, Becky Ligon <[email protected]> wrote:
> Andrew:
>
> Please try the scenario where you have 2 servers, each handling data and
> metadata, no dedicated metadata server, and see if you get better
> performance in the 2 server scenario.
>
> Becky Ligon
>
> On Mon, Apr 23, 2012 at 2:11 PM, Andrew Savchenko <[email protected]> wrote:
>>
>> Hello,
>>
>> I stumbled upon severe read performance degradation when following
>> three requirements are met:
>> 1) Read block size is larger than orangefs stripe size.
>> 2) Read block size is larger than 64 KB (or 128 KB sometimes).
>> 3) There is more than single data server.
>>
>> Write performance is not affected by this. Read performance
>> degradation is huge in magnitude: from 30 times to 100000 times (yes,
>> slowdown by 5 degrees: from 100 MB/s to 1 KB/s).
>>
>> I use orangefs-2.8.5 with 3.2.14 kernel. Full configure options are
>> available in the README file in the attached tarball (in short: aio,
>> kernel-threads, bmi_tcp, mmap, sendfile were enable by default). Some
>> of build options were varied during testing. I used kernel VFS
>> interface for tests.
>>
>> My hardware setup is:
>> 16 ProLiant BL2x220c G5 hosts; per host setup:
>> 32 GB RAM, 2 x Xeon E5450 @3GHz.
>>
>> Network is 1 Gb/s, Full duplex, no flow control (this slightly
>> improves results), 9000 jumbo frames. Network switch is HP GbE2c
>> Layer 2/3 Ethernet Blade Switch.
>>
>> Of 16 nodes up to 15 were used as servers, client node was always
>> dedicated.
>>
>> I ran tests from a single host using mostly iozone with varying
>> record sizes altogether with varying PVFS2 stripe sizes. Some tests
>> were made using other utilities (cp, cat, pvfs2-cp, dd). I tried
>> various network sysctl settings with a little positive effect.
>>
>> All tests results are available in the attached tarball, README file
>> there describes meanings or directory and file names.
>>
>> Both pvfs2-server and pvfs2-client were restarted before each set
>> of tests, kernel module was reloaded. For tests with different
>> filesystem options filesystem was recreated.
>>
>> With a single data server results are good and understandable: poor
>> performance for small block sizes and good for large ones (up to 105
>> MB/s, which is close to network limit). This is valid for both cases:
>> 1 +1 (one dedicated metadata server and one dedicated data server) and
>> 1:1 (single server serves as both data and metadata servers.
>>
>> Problem appears when I add just a single data server (e.g. 1+2
>> configuration), default 64 KB stripe size:
>>  KB  reclen   write rewrite    read    reread
>>  1024       4    6235    6334     6358     6350
>>  1024       8   10424   10583    10456    10395
>>  1024      16   14861   15085    18639    18443
>>  1024      32   26593   27130    30873    31693
>>  1024      64   39460   40759    44671    46421
>>  1024     128   60462   60880    66502    67230
>>  1024     256   78221   79073     2031     4037
>>  1024     512   91542   91526    93568    94542
>>  1024    1024  101808  102329   100965    97235
>> 10240       4    6133    6193     6166     6106
>> 10240       8   10753   10696    10663    10650
>> 10240      16   15011   15088    18496    18485
>> 10240      32   26521   26871    31035    31070
>> 10240      64   40843   41602    47019    48037
>> 10240     128   59882   60358    66284    66707
>> 10240     256   78225   78644     6635     5117
>> 10240     512   91103   91934    25870    20008
>> 10240    1024  101995  101857    27425    11925
>> 10240    2048  108281  108871    17069    14235
>> (full table you may find in the tarball)
>>
>> As you can see, read performance drops drastically after 128 KB block
>> size. With large number of servers an overall picture is the same,
>> but performance degradation is even worse.
>>
>> This can't be disk performance issue, because with single data server
>> things are ok. This is unlikely to be an incast problem, because just
>> two servers are sufficient to reproduce the issue. Also I monitored
>> network, disk activities and CPU utilization. I found no bottlenecks
>> so far.
>>
>> It looks like some buffer throttling to me, but I failed to find that
>> buffer.
>>
>> First I tried to increase network buffers: rmem_max, wmem_max,
>> tcp_rmem, tcp_wmem up to 16 MB is several steps. Just slight
>> degradation was observed. Test results are in *-mem2? directories.
>> I also tried to increase 2nd (default value) field in
>> tcp_rmem/tcp_wmem to 1MB with no effect.
>>
>> Than I tried to disable experimental options enabled in my original
>> setup: threaded-kmod-helper, kernel-sendfile, mmap-racache. I found
>> that sendfile support degrades performance for small access patterns
>> and somewhat increases for large. After this and reading
>> sendfile-related threads in pvfs2 mail lists I decided to disable
>> this option for better overall throughoutput of the setup. Other
>> options have no considerable effect on my tests. Results are in
>> -no(*) directories.
>>
>> Afterwards I tried to increase pvfs2 server buffers manually using
>> TCPBufferSend/TCPBufferReceive options up to 2 MB. I got no effect.
>> Results are in -pvfsbuf-* directories.
>>
>> As you can see in my benchmark script, I set stripe sizes using:
>> setfattr -n user.pvfs2.dist_params -v strip_size:$((s*1024)) $name
>> Changes were confirmed using pvfs2-viedist.
>> To check that default filesystem distribution has no effect I changed
>> default FS stripe size to 256 KB. It has no effect. Results are in
>> -deffsstrip-256k.
>>
>> I found an interesting discussion at pvfs2-users about a similar
>> issue:
>>
>> http://www.beowulf-underground.org/pipermail/pvfs2-users/2009-July/002857.html
>> I tried sysctl parameters recommended here:
>>
>> http://www.beowulf-underground.org/pipermail/pvfs2-users/2009-July/002865.html
>> with disastrous consequences. You may see results in
>> -recommended-sysctl directory. Test was terminated because it took
>> too long. Read speed after "buffering" limit dropped to 1 KB/s (yes,
>> one kilobyte per second). I was shocked, I checked this with dd,
>> result was the same.
>>
>> Then I tried to find out what exact option caused this unimaginable
>> bottleneck. It occurred to be disabled tcp_timestamps:
>> net.ipv4.tcp_timestamps = 0
>> (See -recommended-sysctl[23] for bisection results.)
>> Other options had no measurable impact.
>>
>> I don't know why disabled timestamps caused such disaster, maybe this
>> is a clue for the original issue, but I don't understand this. To my
>> knowledge tcp timestamps should have no effect on such small data
>> samples at all.
>>
>> Afterwards I tried to find stripe size for "normal" cp/cat
>> operations for the best conditions I managed to found: 15:15-no
>> (sendfile) — 15 servers serving both data and metadata with sendfile
>> support disabled. Test results are in -utils directory. I found that
>> 1 MB stripe size gives best performance and further growing of stripe
>> sizes gives no significant benefit.
>>
>> As a conclusion:
>>
>> I still can't find the reason of the original bottleneck. I suspend
>> this is some kind of insufficient buffer condition but I can't find
>> that buffer. Any help will be highly welcomed.
>>
>> Best regards,
>> Andrew Savchenko
>>
>> _______________________________________________
>> Pvfs2-users mailing list
>> [email protected]
>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>
>
>
>
> --
> Becky Ligon
> OrangeFS Support and Development
> Omnibond Systems
> Anderson, South Carolina
>
>
>
> _______________________________________________
> Pvfs2-users mailing list
> [email protected]
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to