Andrew: Please try the scenario where you have 2 servers, each handling data and metadata, no dedicated metadata server, and see if you get better performance in the 2 server scenario.
Becky Ligon On Mon, Apr 23, 2012 at 2:11 PM, Andrew Savchenko <[email protected]> wrote: > Hello, > > I stumbled upon severe read performance degradation when following > three requirements are met: > 1) Read block size is larger than orangefs stripe size. > 2) Read block size is larger than 64 KB (or 128 KB sometimes). > 3) There is more than single data server. > > Write performance is not affected by this. Read performance > degradation is huge in magnitude: from 30 times to 100000 times (yes, > slowdown by 5 degrees: from 100 MB/s to 1 KB/s). > > I use orangefs-2.8.5 with 3.2.14 kernel. Full configure options are > available in the README file in the attached tarball (in short: aio, > kernel-threads, bmi_tcp, mmap, sendfile were enable by default). Some > of build options were varied during testing. I used kernel VFS > interface for tests. > > My hardware setup is: > 16 ProLiant BL2x220c G5 hosts; per host setup: > 32 GB RAM, 2 x Xeon E5450 @3GHz. > > Network is 1 Gb/s, Full duplex, no flow control (this slightly > improves results), 9000 jumbo frames. Network switch is HP GbE2c > Layer 2/3 Ethernet Blade Switch. > > Of 16 nodes up to 15 were used as servers, client node was always > dedicated. > > I ran tests from a single host using mostly iozone with varying > record sizes altogether with varying PVFS2 stripe sizes. Some tests > were made using other utilities (cp, cat, pvfs2-cp, dd). I tried > various network sysctl settings with a little positive effect. > > All tests results are available in the attached tarball, README file > there describes meanings or directory and file names. > > Both pvfs2-server and pvfs2-client were restarted before each set > of tests, kernel module was reloaded. For tests with different > filesystem options filesystem was recreated. > > With a single data server results are good and understandable: poor > performance for small block sizes and good for large ones (up to 105 > MB/s, which is close to network limit). This is valid for both cases: > 1 +1 (one dedicated metadata server and one dedicated data server) and > 1:1 (single server serves as both data and metadata servers. > > Problem appears when I add just a single data server (e.g. 1+2 > configuration), default 64 KB stripe size: > KB reclen write rewrite read reread > 1024 4 6235 6334 6358 6350 > 1024 8 10424 10583 10456 10395 > 1024 16 14861 15085 18639 18443 > 1024 32 26593 27130 30873 31693 > 1024 64 39460 40759 44671 46421 > 1024 128 60462 60880 66502 67230 > 1024 256 78221 79073 2031 4037 > 1024 512 91542 91526 93568 94542 > 1024 1024 101808 102329 100965 97235 > 10240 4 6133 6193 6166 6106 > 10240 8 10753 10696 10663 10650 > 10240 16 15011 15088 18496 18485 > 10240 32 26521 26871 31035 31070 > 10240 64 40843 41602 47019 48037 > 10240 128 59882 60358 66284 66707 > 10240 256 78225 78644 6635 5117 > 10240 512 91103 91934 25870 20008 > 10240 1024 101995 101857 27425 11925 > 10240 2048 108281 108871 17069 14235 > (full table you may find in the tarball) > > As you can see, read performance drops drastically after 128 KB block > size. With large number of servers an overall picture is the same, > but performance degradation is even worse. > > This can't be disk performance issue, because with single data server > things are ok. This is unlikely to be an incast problem, because just > two servers are sufficient to reproduce the issue. Also I monitored > network, disk activities and CPU utilization. I found no bottlenecks > so far. > > It looks like some buffer throttling to me, but I failed to find that > buffer. > > First I tried to increase network buffers: rmem_max, wmem_max, > tcp_rmem, tcp_wmem up to 16 MB is several steps. Just slight > degradation was observed. Test results are in *-mem2? directories. > I also tried to increase 2nd (default value) field in > tcp_rmem/tcp_wmem to 1MB with no effect. > > Than I tried to disable experimental options enabled in my original > setup: threaded-kmod-helper, kernel-sendfile, mmap-racache. I found > that sendfile support degrades performance for small access patterns > and somewhat increases for large. After this and reading > sendfile-related threads in pvfs2 mail lists I decided to disable > this option for better overall throughoutput of the setup. Other > options have no considerable effect on my tests. Results are in > -no(*) directories. > > Afterwards I tried to increase pvfs2 server buffers manually using > TCPBufferSend/TCPBufferReceive options up to 2 MB. I got no effect. > Results are in -pvfsbuf-* directories. > > As you can see in my benchmark script, I set stripe sizes using: > setfattr -n user.pvfs2.dist_params -v strip_size:$((s*1024)) $name > Changes were confirmed using pvfs2-viedist. > To check that default filesystem distribution has no effect I changed > default FS stripe size to 256 KB. It has no effect. Results are in > -deffsstrip-256k. > > I found an interesting discussion at pvfs2-users about a similar > issue: > > http://www.beowulf-underground.org/pipermail/pvfs2-users/2009-July/002857.html > I tried sysctl parameters recommended here: > > http://www.beowulf-underground.org/pipermail/pvfs2-users/2009-July/002865.html > with disastrous consequences. You may see results in > -recommended-sysctl directory. Test was terminated because it took > too long. Read speed after "buffering" limit dropped to 1 KB/s (yes, > one kilobyte per second). I was shocked, I checked this with dd, > result was the same. > > Then I tried to find out what exact option caused this unimaginable > bottleneck. It occurred to be disabled tcp_timestamps: > net.ipv4.tcp_timestamps = 0 > (See -recommended-sysctl[23] for bisection results.) > Other options had no measurable impact. > > I don't know why disabled timestamps caused such disaster, maybe this > is a clue for the original issue, but I don't understand this. To my > knowledge tcp timestamps should have no effect on such small data > samples at all. > > Afterwards I tried to find stripe size for "normal" cp/cat > operations for the best conditions I managed to found: 15:15-no > (sendfile) — 15 servers serving both data and metadata with sendfile > support disabled. Test results are in -utils directory. I found that > 1 MB stripe size gives best performance and further growing of stripe > sizes gives no significant benefit. > > As a conclusion: > > I still can't find the reason of the original bottleneck. I suspend > this is some kind of insufficient buffer condition but I can't find > that buffer. Any help will be highly welcomed. > > Best regards, > Andrew Savchenko > > _______________________________________________ > Pvfs2-users mailing list > [email protected] > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users > > -- Becky Ligon OrangeFS Support and Development Omnibond Systems Anderson, South Carolina
_______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
