Rafi, great results, thanks.   Your "io-cache off" columns are read tests with 
the io-cache translator disabled, correct?  What jumps out at me from your 
numbers are two things:

- io-cache translator destroys RDMA read performance. 
- approach 2i) "register iobuf pool" is the best approach.
-- on reads with io-cache off, 32% better than baseline and 21% better than 1) 
"separate buffer" 
-- on writes, 22% better than baseline and 14% better than 1)

Can someone explain to me why the typical Gluster site wants to use the 
io-cache translator, given that FUSE now caches file data?  Should we just have 
it turned off by default at this point?  This would buy us time to change 
io-cache implementation to be compatible with RDMA (see below option "2ii").

remaining comments inline

-ben

----- Original Message -----
> From: "Mohammed Rafi K C" <rkavu...@redhat.com>
> To: gluster-devel@gluster.org
> Cc: "Raghavendra Gowdappa" <rgowd...@redhat.com>, "Anand Avati" 
> <av...@gluster.org>, "Ben Turner"
> <btur...@redhat.com>, "Ben England" <bengl...@redhat.com>, "Suman Debnath" 
> <sdebn...@redhat.com>
> Sent: Friday, January 23, 2015 7:43:45 AM
> Subject: RDMA: Patch to make use of pre registered memory
> 
> Hi All,
> 
> As I pointed out earlier, for rdma protocol, we need to register memory
> which is used during rdma read and write with rdma device. In fact it is a
> costly operation. To avoid the registration of memory in i/o path, we
> came up with two solutions.
> 
> 1) To use a separate per-registered iobuf_pool for rdma. The approach
> needs an extra level copying in rdma for each read/write request. ie, we
> need to copy the content of memory given by application to buffers of
> rdma in the rdma code.
> 

copying data defeats the whole point of RDMA, which is to *avoid* copying data. 
  

> 2) Register default iobuf_pool in glusterfs_ctx with rdma device during
> the rdma
> initialize. Since we are registering buffers from the default pool for
> read/write, we don't require either registration or copying. 

This makes far more sense to me.

> But the
> problem comes when io-cache translator is turned-on; then for each page
> fault, io-cache will take a ref on the io-buf of the response buffer to
> cache it, due to this all the pre-allocated buffer will get locked with
> io-cache very soon.
> Eventually all new requests would get iobufs from new iobuf_pools which
> are not
> registered with rdma and we will have to do registration for every iobuf.
> To address this issue, we can:
> 
>              i)  Turn-off io-cache
> (we chose this for testing)
>             ii)  Use separate buffer for io-cache, and offload from
>                 default pool to io-cache buffer.
> (New thread to offload)


I think this makes sense, because if you get a io-cache translator cache hit, 
then you don't need to go out to the network, so io-cache memory doesn't have 
to be registered with RDMA.

>             iii) Dynamically register each newly created arena with rdma,
>                  for this need to bring libglusterfs code and transport
> layer code together.
>                      (Will need changes in packaging and may bring hard
> dependencies of rdma libs)
>            iv) Increase the default pool size.
>                     (Will increase the footprint of glusterfs process)
> 

registration with RDMA only makes sense to me when data is going to be 
sent/received over the RDMA network.  Is it hard to tell in advance which 
buffers will need to be transmitted?

> We implemented two approaches,  (1) and (2i) to get some
> performance numbers. The setup was 4*2 distributed-replicated volume
> using ram disks as bricks to avoid hard disk bottleneck. And the numbers
> are attached with the mail.
> 
> 
> Please provide the your thoughts on these approaches.
> 
> Regards
> Rafi KC
> 
> 
> 
        Seperate buffer for rdma (1)            No change               
Register Default iobuf pool(2i) 
        write   read    io-cache off    write   read    io-cache off    write   
read    io-cache off
1       373     527     656             343     483     532             446     
512     696
2       380     528     668             347     485     540             426     
525     715
3       376     527     594             346     482     540             422     
526     720
4       381     533     597             348     484     540             413     
526     710
5       372     527     479             347     482     538             422     
519     719
Note: (varying result )
Average 376.4   528.4   598.8           346.2   483.2   538             425.8   
521.6   712
                                                                
command read:   echo 3 > /proc/sys/vm/drop_caches; dd 
if=/home/ram0/mount0/foo.txt of=/dev/null bs=1024K count=1000;
        write   echo 3 > /proc/sys/vm/drop_caches; dd 
of=/home/ram0/mount0/foo.txt if=/dev/zero bs=1024K count=1000 conv=sync;

                                                        
vol info        "Volume Name: xcube
Type: Distributed-Replicate
Volume ID: 84cbc80f-bf93-4b10-9865-79a129efe2f5
Status: Started
Snap Volume: no
Number of Bricks: 4 x 2 = 8
Transport-type: rdma
Bricks:
Brick1: 192.168.44.105:/home/ram0/b0
Brick2: 192.168.44.106:/home/ram0/b0
Brick3: 192.168.44.107:/brick/0/b0
Brick4: 192.168.44.108:/brick/0/b0
Brick5: 192.168.44.105:/home/ram1/b1
Brick6: 192.168.44.106:/home/ram1/b1
Brick7: 192.168.44.107:/brick/1/b1
Brick8: 192.168.44.108:/brick/1/b1
Options Reconfigured:
performance.io-cache: on
performance.readdir-ahead: on
snap-max-hard-limit: 256
snap-max-soft-limit: 90
auto-delete: disable    
_______________________________________________
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Reply via email to