How many nodes make up that volume that you were using for testing? Over 100 nodes running at QDR/IPoIB using 100 threads we we ran around 60GB/s read and somewhere in the 40GB/s for writes (iirc).
On Jul 10, 2013, at 1:49 PM, Matthew Nicholson <[email protected]> wrote: > Well, first of all,thank for the responses. The volume WAS failing over the > tcp just as predicted,though WHY is unclear as the fabric is know working > (has about 28K compute cores on it all doing heavy MPI testing on it), and > the OFED/verbs stack is consistent across all client/storage systems > (actually, the OS image is identical). > > Thats quiet sad RDMA isn't going to make 3.4. We put a good deal of hopes and > effort around planning for 3.4 for this storage systems, specifically for > RDMA support (well, with warnings to the team that it wasn't in/test for 3.3 > and that all we could do was HOPE it was in 3.4 and in time for when we want > to go live). we're getting "okay" performance out of IPoIB right now, and our > bottle neck actually seems to be the fabric design/layout, as we're peaking > at about 4.2GB/s writing 10TB over 160 threads to this distributed volume. > > When it IS ready and in 3.4.1 (hopefully!), having good docs around it, and > maybe even a simple printf for the tcp failover would be huge for us. > > > > -- > Matthew Nicholson > Research Computing Specialist > Harvard FAS Research Computing > [email protected] > > > > On Wed, Jul 10, 2013 at 3:18 AM, Justin Clift <[email protected]> wrote: >> Hi guys, >> >> As an FYI, from discussion on gluster-devel IRC yesterday, the RDMA code >> still isn't in a good enough state for production usage with 3.4.0. :( >> >> There are still outstanding bugs with it, and I'm working to make the >> Gluster Test Framework able to work with RDMA so we can help shake out >> more of them: >> >> >> http://www.gluster.org/community/documentation/index.php/Using_the_Gluster_Test_Framework >> >> Hopefully RDMA will be ready for 3.4.1, but don't hold me to that at >> this stage. :) >> >> Regards and best wishes, >> >> Justin Clift >> >> >> On 09/07/2013, at 8:36 PM, Ryan Aydelott wrote: >> > Matthew, >> > >> > Personally - I have experienced this same problem (even with the mount >> > being something.rdma). Running 3.4beta4, if I mounted a volume via RDMA >> > that also had TCP configured as a transport option (which obviously you do >> > based on the mounts you gave below), if there is ANY issue with RDMA not >> > working the mount will silently fall back to TCP. This problem is >> > described here: https://bugzilla.redhat.com/show_bug.cgi?id=982757 >> > >> > The way to test for this behavior is create a new volume specifying ONLY >> > RDMA as the transport. If you mount this and your RDMA is broken for >> > whatever reason - it will simply fail to mount. >> > >> > Assuming this test fails, I would then tail the logs for the volume to get >> > a hint of what's going on. In my case there was an RDMA_CM kernel module >> > that was not loaded which started to matter as of 3.4beta2 IIRC as they >> > did a complete rewrite for this based on poor performance in prior >> > releases. The clue in my volume log file was "no such file or directory" >> > preceded with an rdma_cm. >> > >> > Hope that helps! >> > >> > >> > -ryan >> > >> > >> > On Jul 9, 2013, at 2:03 PM, Matthew Nicholson >> > <[email protected]> wrote: >> > >> >> Hey guys, >> >> >> >> So, we're testing Gluster RDMA storage, and are having some issues. >> >> Things are working...just not as we expected them. THere isn't a whole >> >> lot in the way, that I've foudn on docs for gluster rdma, aside from >> >> basically "install gluster-rdma", create a volume with transport=rdma, >> >> and mount w/ transport=rdma.... >> >> >> >> I've done that...and the IB fabric is known to be good...however, a >> >> volume created with transport=rdma,tcp and mounted w/ transport=rdma, >> >> still seems to go over tcp? >> >> >> >> A little more info about the setup: >> >> >> >> we've got 10 storage nodes/bricks, each of which has a single 1GB NIC and >> >> a FRD IB port. Likewise for the test clients. Now, the 1GB nic is for >> >> management only, and we have all of the systems on this fabric configured >> >> with IPoIB, so there is eth0, and ib0 on each node. >> >> >> >> All storage nodes are peer'd using the ib0 interface, ie: >> >> >> >> gluster peer probe storage_node01-ib >> >> etc >> >> >> >> thats all well and good. >> >> >> >> Volume was created: >> >> >> >> gluster volume create holyscratch transport rdma,tcp >> >> holyscratch01-ib:/holyscratch01/brick >> >> for i in `seq -w 2 10` ; do gluster volume add-brick holyscratch >> >> holyscratch${i}-ib:/holyscratch${i}/brick; done >> >> >> >> yielding: >> >> >> >> Volume Name: holyscratch >> >> Type: Distribute >> >> Volume ID: 788e74dc-6ae2-4aa5-8252-2f30262f0141 >> >> Status: Started >> >> Number of Bricks: 10 >> >> Transport-type: tcp,rdma >> >> Bricks: >> >> Brick1: holyscratch01-ib:/holyscratch01/brick >> >> Brick2: holyscratch02-ib:/holyscratch02/brick >> >> Brick3: holyscratch03-ib:/holyscratch03/brick >> >> Brick4: holyscratch04-ib:/holyscratch04/brick >> >> Brick5: holyscratch05-ib:/holyscratch05/brick >> >> Brick6: holyscratch06-ib:/holyscratch06/brick >> >> Brick7: holyscratch07-ib:/holyscratch07/brick >> >> Brick8: holyscratch08-ib:/holyscratch08/brick >> >> Brick9: holyscratch09-ib:/holyscratch09/brick >> >> Brick10: holyscratch10-ib:/holyscratch10/brick >> >> Options Reconfigured: >> >> nfs.disable: on >> >> >> >> >> >> For testing, we wanted to see how rdma stacked up vs tcp using IPoIB, so >> >> we mounted this like: >> >> >> >> [root@holy2a01202 holyscratch.tcp]# df -h |grep holyscratch >> >> holyscratch:/holyscratch >> >> 273T 4.1T 269T 2% /n/holyscratch.tcp >> >> holyscratch:/holyscratch.rdma >> >> 273T 4.1T 269T 2% /n/holyscratch.rdma >> >> >> >> so, 2 mounts, same volume different transports. fstab looks like: >> >> >> >> holyscratch:/holyscratch /n/holyscratch.tcp glusterfs >> >> transport=tcp,fetch-attempts=10,gid-timeout=2,acl,_netdev 0 0 >> >> holyscratch:/holyscratch /n/holyscratch.rdma glusterfs >> >> transport=rdma,fetch-attempts=10,gid-timeout=2,acl,_netdev 0 0 >> >> >> >> where holyscratch is a RRDNS entry for all the IPoIB interfaces for >> >> fetching the volfile (something it seems, just like peering, MUST be tcp? >> >> ) >> >> >> >> but, again, when running just dumb,dumb,dumb tests (160 threads of dd >> >> over 8 nodes w/ each thread writing 64GB, so a 10TB throughput test), I'm >> >> seeing all the traffic on the IPoIB interface for both RDMA and TCP >> >> transports...when i really shouldn't be seeing ANY tcp traffic, aside >> >> from volfile fetches/management on the IPoIB interface when using RDMA as >> >> a transport...right? As a result, from early tests (the bigger 10TB ones >> >> are running now), the tpc and rdma speeds were basically the same...when >> >> i would expect the RDMA one to be at least slightly faster... >> >> >> >> >> >> Oh, and this is all 3.4beta4, on both the clients and storage nodes. >> >> >> >> So, I guess my questions are: >> >> >> >> Is this expected/normal? >> >> Is peering/volfile fetching always tcp based? >> >> How should one peer nodes in a RDMA setup? >> >> Should this be tried with only RDMA as a transport on the volume? >> >> Are there more detailed docs for RDMA gluster coming w/ the 3.4 release? >> >> >> >> >> >> -- >> >> Matthew Nicholson >> >> Research Computing Specialist >> >> Harvard FAS Research Computing >> >> [email protected] >> >> >> >> _______________________________________________ >> >> Gluster-users mailing list >> >> [email protected] >> >> http://supercolony.gluster.org/mailman/listinfo/gluster-users >> > >> > _______________________________________________ >> > Gluster-users mailing list >> > [email protected] >> > http://supercolony.gluster.org/mailman/listinfo/gluster-users >> >> -- >> Open Source and Standards @ Red Hat >> >> twitter.com/realjustinclift >
_______________________________________________ Gluster-users mailing list [email protected] http://supercolony.gluster.org/mailman/listinfo/gluster-users
