Re: [Gluster-users] tips/nest practices for gluster rdma?

Ryan Aydelott Wed, 10 Jul 2013 13:34:15 -0700

How many nodes make up that volume that you were using for testing?

Over 100 nodes running at QDR/IPoIB using 100 threads we we ran around 60GB/s 
read and somewhere in the 40GB/s for writes (iirc).


On Jul 10, 2013, at 1:49 PM, Matthew Nicholson <[email protected]> 
wrote:

> Well, first of all,thank for the responses. The volume WAS failing over the 
> tcp just as predicted,though WHY is unclear as the fabric is know working 
> (has about 28K compute cores on it all doing heavy MPI testing on it), and 
> the OFED/verbs stack is consistent across all client/storage systems 
> (actually, the OS image is identical). 
> 
> Thats quiet sad RDMA isn't going to make 3.4. We put a good deal of hopes and 
> effort around planning for 3.4 for this storage systems, specifically for 
> RDMA support (well, with warnings to the team that it wasn't in/test for 3.3 
> and that all we could do was HOPE it was in 3.4 and in time for when we want 
> to go live). we're getting "okay" performance out of IPoIB right now, and our 
> bottle neck actually seems to be the fabric design/layout, as we're peaking 
> at about 4.2GB/s writing 10TB over 160 threads to this distributed volume. 
> 
> When it IS ready and in 3.4.1 (hopefully!), having good docs around it, and 
> maybe even a simple printf for the tcp failover would be huge for us. 
> 
> 
> 
> --
> Matthew Nicholson
> Research Computing Specialist
> Harvard FAS Research Computing
> [email protected]
> 
> 
> 
> On Wed, Jul 10, 2013 at 3:18 AM, Justin Clift <[email protected]> wrote:
>> Hi guys,
>> 
>> As an FYI, from discussion on gluster-devel IRC yesterday, the RDMA code
>> still isn't in a good enough state for production usage with 3.4.0. :(
>> 
>> There are still outstanding bugs with it, and I'm working to make the
>> Gluster Test Framework able to work with RDMA so we can help shake out
>> more of them:
>> 
>>   
>> http://www.gluster.org/community/documentation/index.php/Using_the_Gluster_Test_Framework
>> 
>> Hopefully RDMA will be ready for 3.4.1, but don't hold me to that at
>> this stage. :)
>> 
>> Regards and best wishes,
>> 
>> Justin Clift
>> 
>> 
>> On 09/07/2013, at 8:36 PM, Ryan Aydelott wrote:
>> > Matthew,
>> >
>> > Personally - I have experienced this same problem (even with the mount 
>> > being something.rdma). Running 3.4beta4, if I mounted a volume via RDMA 
>> > that also had TCP configured as a transport option (which obviously you do 
>> > based on the mounts you gave below), if there is ANY issue with RDMA not 
>> > working the mount will silently fall back to TCP. This problem is 
>> > described here: https://bugzilla.redhat.com/show_bug.cgi?id=982757
>> >
>> > The way to test for this behavior is create a new volume specifying ONLY 
>> > RDMA as the transport. If you mount this and your RDMA is broken for 
>> > whatever reason - it will simply fail to mount.
>> >
>> > Assuming this test fails, I would then tail the logs for the volume to get 
>> > a hint of what's going on. In my case there was an RDMA_CM kernel module 
>> > that was not loaded which started to matter as of 3.4beta2 IIRC as they 
>> > did a complete rewrite for this based on poor performance in prior 
>> > releases. The clue in my volume log file was "no such file or directory" 
>> > preceded with an rdma_cm.
>> >
>> > Hope that helps!
>> >
>> >
>> > -ryan
>> >
>> >
>> > On Jul 9, 2013, at 2:03 PM, Matthew Nicholson 
>> > <[email protected]> wrote:
>> >
>> >> Hey guys,
>> >>
>> >> So, we're testing Gluster RDMA storage, and are having some issues. 
>> >> Things are working...just not as we expected them. THere isn't a whole 
>> >> lot in the way, that I've foudn on docs for gluster rdma, aside from 
>> >> basically "install gluster-rdma", create a volume with transport=rdma, 
>> >> and mount w/ transport=rdma....
>> >>
>> >> I've done that...and the IB fabric is known to be good...however, a 
>> >> volume created with transport=rdma,tcp and mounted w/ transport=rdma, 
>> >> still seems to go over tcp?
>> >>
>> >> A little more info about the setup:
>> >>
>> >> we've got 10 storage nodes/bricks, each of which has a single 1GB NIC and 
>> >> a FRD IB port. Likewise for the test clients. Now, the 1GB nic is for 
>> >> management only, and we have all of the systems on this fabric configured 
>> >> with IPoIB, so there is eth0, and ib0 on each node.
>> >>
>> >> All storage nodes are peer'd using the ib0 interface, ie:
>> >>
>> >> gluster peer probe storage_node01-ib
>> >> etc
>> >>
>> >> thats all well and good.
>> >>
>> >> Volume was created:
>> >>
>> >> gluster volume create holyscratch transport rdma,tcp 
>> >> holyscratch01-ib:/holyscratch01/brick
>> >> for i in `seq -w 2 10` ; do gluster volume add-brick holyscratch 
>> >> holyscratch${i}-ib:/holyscratch${i}/brick; done
>> >>
>> >> yielding:
>> >>
>> >> Volume Name: holyscratch
>> >> Type: Distribute
>> >> Volume ID: 788e74dc-6ae2-4aa5-8252-2f30262f0141
>> >> Status: Started
>> >> Number of Bricks: 10
>> >> Transport-type: tcp,rdma
>> >> Bricks:
>> >> Brick1: holyscratch01-ib:/holyscratch01/brick
>> >> Brick2: holyscratch02-ib:/holyscratch02/brick
>> >> Brick3: holyscratch03-ib:/holyscratch03/brick
>> >> Brick4: holyscratch04-ib:/holyscratch04/brick
>> >> Brick5: holyscratch05-ib:/holyscratch05/brick
>> >> Brick6: holyscratch06-ib:/holyscratch06/brick
>> >> Brick7: holyscratch07-ib:/holyscratch07/brick
>> >> Brick8: holyscratch08-ib:/holyscratch08/brick
>> >> Brick9: holyscratch09-ib:/holyscratch09/brick
>> >> Brick10: holyscratch10-ib:/holyscratch10/brick
>> >> Options Reconfigured:
>> >> nfs.disable: on
>> >>
>> >>
>> >> For testing, we wanted to see how rdma stacked up vs tcp using IPoIB, so 
>> >> we mounted this like:
>> >>
>> >> [root@holy2a01202 holyscratch.tcp]# df -h |grep holyscratch
>> >> holyscratch:/holyscratch
>> >>                       273T  4.1T  269T   2% /n/holyscratch.tcp
>> >> holyscratch:/holyscratch.rdma
>> >>                       273T  4.1T  269T   2% /n/holyscratch.rdma
>> >>
>> >> so, 2 mounts, same volume different transports. fstab looks like:
>> >>
>> >> holyscratch:/holyscratch        /n/holyscratch.tcp      glusterfs       
>> >> transport=tcp,fetch-attempts=10,gid-timeout=2,acl,_netdev       0       0
>> >> holyscratch:/holyscratch        /n/holyscratch.rdma     glusterfs       
>> >> transport=rdma,fetch-attempts=10,gid-timeout=2,acl,_netdev      0       0
>> >>
>> >> where holyscratch is a RRDNS entry for all the IPoIB interfaces for 
>> >> fetching the volfile (something it seems, just like peering, MUST be tcp? 
>> >> )
>> >>
>> >> but, again, when running just dumb,dumb,dumb tests (160 threads of dd 
>> >> over 8 nodes w/ each thread writing 64GB, so a 10TB throughput test), I'm 
>> >> seeing all the traffic on the IPoIB interface for both RDMA and TCP 
>> >> transports...when i really shouldn't be seeing ANY tcp traffic, aside 
>> >> from volfile fetches/management on the IPoIB interface when using RDMA as 
>> >> a transport...right? As a result, from early tests (the bigger 10TB ones 
>> >> are running now), the tpc and rdma speeds were basically the same...when 
>> >> i would expect the RDMA one to be at least slightly faster...
>> >>
>> >>
>> >> Oh, and this is all 3.4beta4, on both the clients and storage nodes.
>> >>
>> >> So, I guess my questions are:
>> >>
>> >> Is this expected/normal?
>> >> Is peering/volfile fetching always tcp based?
>> >> How should one peer nodes in a RDMA setup?
>> >> Should this be tried with only RDMA as a transport on the volume?
>> >> Are there more detailed docs for RDMA gluster coming w/ the 3.4 release?
>> >>
>> >>
>> >> --
>> >> Matthew Nicholson
>> >> Research Computing Specialist
>> >> Harvard FAS Research Computing
>> >> [email protected]
>> >>
>> >> _______________________________________________
>> >> Gluster-users mailing list
>> >> [email protected]
>> >> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>> >
>> > _______________________________________________
>> > Gluster-users mailing list
>> > [email protected]
>> > http://supercolony.gluster.org/mailman/listinfo/gluster-users
>> 
>> --
>> Open Source and Standards @ Red Hat
>> 
>> twitter.com/realjustinclift
>

_______________________________________________
Gluster-users mailing list
[email protected]
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] tips/nest practices for gluster rdma?

Reply via email to