Hi Andrew - Adding on to what Becky mentioned, Can you run some netpipe tcp runs to see how your network is performing, just to double/triple-check ?
I typically set the FlowBufferSize to something higher like 1MB (or better yet, to something that aligns with netpipe's results for higher bandwidth), have you tried this?-- it probably wont fix the problem, but may give modestly improved results. Kyle Schochenmaier On Mon, Apr 23, 2012 at 1:40 PM, Becky Ligon <[email protected]> wrote: > Andrew: > > Please try the scenario where you have 2 servers, each handling data and > metadata, no dedicated metadata server, and see if you get better > performance in the 2 server scenario. > > Becky Ligon > > On Mon, Apr 23, 2012 at 2:11 PM, Andrew Savchenko <[email protected]> wrote: >> >> Hello, >> >> I stumbled upon severe read performance degradation when following >> three requirements are met: >> 1) Read block size is larger than orangefs stripe size. >> 2) Read block size is larger than 64 KB (or 128 KB sometimes). >> 3) There is more than single data server. >> >> Write performance is not affected by this. Read performance >> degradation is huge in magnitude: from 30 times to 100000 times (yes, >> slowdown by 5 degrees: from 100 MB/s to 1 KB/s). >> >> I use orangefs-2.8.5 with 3.2.14 kernel. Full configure options are >> available in the README file in the attached tarball (in short: aio, >> kernel-threads, bmi_tcp, mmap, sendfile were enable by default). Some >> of build options were varied during testing. I used kernel VFS >> interface for tests. >> >> My hardware setup is: >> 16 ProLiant BL2x220c G5 hosts; per host setup: >> 32 GB RAM, 2 x Xeon E5450 @3GHz. >> >> Network is 1 Gb/s, Full duplex, no flow control (this slightly >> improves results), 9000 jumbo frames. Network switch is HP GbE2c >> Layer 2/3 Ethernet Blade Switch. >> >> Of 16 nodes up to 15 were used as servers, client node was always >> dedicated. >> >> I ran tests from a single host using mostly iozone with varying >> record sizes altogether with varying PVFS2 stripe sizes. Some tests >> were made using other utilities (cp, cat, pvfs2-cp, dd). I tried >> various network sysctl settings with a little positive effect. >> >> All tests results are available in the attached tarball, README file >> there describes meanings or directory and file names. >> >> Both pvfs2-server and pvfs2-client were restarted before each set >> of tests, kernel module was reloaded. For tests with different >> filesystem options filesystem was recreated. >> >> With a single data server results are good and understandable: poor >> performance for small block sizes and good for large ones (up to 105 >> MB/s, which is close to network limit). This is valid for both cases: >> 1 +1 (one dedicated metadata server and one dedicated data server) and >> 1:1 (single server serves as both data and metadata servers. >> >> Problem appears when I add just a single data server (e.g. 1+2 >> configuration), default 64 KB stripe size: >> KB reclen write rewrite read reread >> 1024 4 6235 6334 6358 6350 >> 1024 8 10424 10583 10456 10395 >> 1024 16 14861 15085 18639 18443 >> 1024 32 26593 27130 30873 31693 >> 1024 64 39460 40759 44671 46421 >> 1024 128 60462 60880 66502 67230 >> 1024 256 78221 79073 2031 4037 >> 1024 512 91542 91526 93568 94542 >> 1024 1024 101808 102329 100965 97235 >> 10240 4 6133 6193 6166 6106 >> 10240 8 10753 10696 10663 10650 >> 10240 16 15011 15088 18496 18485 >> 10240 32 26521 26871 31035 31070 >> 10240 64 40843 41602 47019 48037 >> 10240 128 59882 60358 66284 66707 >> 10240 256 78225 78644 6635 5117 >> 10240 512 91103 91934 25870 20008 >> 10240 1024 101995 101857 27425 11925 >> 10240 2048 108281 108871 17069 14235 >> (full table you may find in the tarball) >> >> As you can see, read performance drops drastically after 128 KB block >> size. With large number of servers an overall picture is the same, >> but performance degradation is even worse. >> >> This can't be disk performance issue, because with single data server >> things are ok. This is unlikely to be an incast problem, because just >> two servers are sufficient to reproduce the issue. Also I monitored >> network, disk activities and CPU utilization. I found no bottlenecks >> so far. >> >> It looks like some buffer throttling to me, but I failed to find that >> buffer. >> >> First I tried to increase network buffers: rmem_max, wmem_max, >> tcp_rmem, tcp_wmem up to 16 MB is several steps. Just slight >> degradation was observed. Test results are in *-mem2? directories. >> I also tried to increase 2nd (default value) field in >> tcp_rmem/tcp_wmem to 1MB with no effect. >> >> Than I tried to disable experimental options enabled in my original >> setup: threaded-kmod-helper, kernel-sendfile, mmap-racache. I found >> that sendfile support degrades performance for small access patterns >> and somewhat increases for large. After this and reading >> sendfile-related threads in pvfs2 mail lists I decided to disable >> this option for better overall throughoutput of the setup. Other >> options have no considerable effect on my tests. Results are in >> -no(*) directories. >> >> Afterwards I tried to increase pvfs2 server buffers manually using >> TCPBufferSend/TCPBufferReceive options up to 2 MB. I got no effect. >> Results are in -pvfsbuf-* directories. >> >> As you can see in my benchmark script, I set stripe sizes using: >> setfattr -n user.pvfs2.dist_params -v strip_size:$((s*1024)) $name >> Changes were confirmed using pvfs2-viedist. >> To check that default filesystem distribution has no effect I changed >> default FS stripe size to 256 KB. It has no effect. Results are in >> -deffsstrip-256k. >> >> I found an interesting discussion at pvfs2-users about a similar >> issue: >> >> http://www.beowulf-underground.org/pipermail/pvfs2-users/2009-July/002857.html >> I tried sysctl parameters recommended here: >> >> http://www.beowulf-underground.org/pipermail/pvfs2-users/2009-July/002865.html >> with disastrous consequences. You may see results in >> -recommended-sysctl directory. Test was terminated because it took >> too long. Read speed after "buffering" limit dropped to 1 KB/s (yes, >> one kilobyte per second). I was shocked, I checked this with dd, >> result was the same. >> >> Then I tried to find out what exact option caused this unimaginable >> bottleneck. It occurred to be disabled tcp_timestamps: >> net.ipv4.tcp_timestamps = 0 >> (See -recommended-sysctl[23] for bisection results.) >> Other options had no measurable impact. >> >> I don't know why disabled timestamps caused such disaster, maybe this >> is a clue for the original issue, but I don't understand this. To my >> knowledge tcp timestamps should have no effect on such small data >> samples at all. >> >> Afterwards I tried to find stripe size for "normal" cp/cat >> operations for the best conditions I managed to found: 15:15-no >> (sendfile) — 15 servers serving both data and metadata with sendfile >> support disabled. Test results are in -utils directory. I found that >> 1 MB stripe size gives best performance and further growing of stripe >> sizes gives no significant benefit. >> >> As a conclusion: >> >> I still can't find the reason of the original bottleneck. I suspend >> this is some kind of insufficient buffer condition but I can't find >> that buffer. Any help will be highly welcomed. >> >> Best regards, >> Andrew Savchenko >> >> _______________________________________________ >> Pvfs2-users mailing list >> [email protected] >> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users >> > > > > -- > Becky Ligon > OrangeFS Support and Development > Omnibond Systems > Anderson, South Carolina > > > > _______________________________________________ > Pvfs2-users mailing list > [email protected] > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users > _______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
