Hi All,

As I pointed out earlier, for rdma protocol, we need to register memory
which is used during rdma read and write with rdma device. In fact it is a
costly operation. To avoid the registration of memory in i/o path, we
came up with two solutions.

1) To use a separate per-registered iobuf_pool for rdma. The approach
needs an extra level copying in rdma for each read/write request. ie, we
need to copy the content of memory given by application to buffers of
rdma in the rdma code.

2) Register default iobuf_pool in glusterfs_ctx with rdma device during
the rdma
initialize. Since we are registering buffers from the default pool for
read/write, we don't require either registration or copying. But the
problem comes when io-cache translator is turned-on; then for each page
fault, io-cache will take a ref on the io-buf of the response buffer to
cache it, due to this all the pre-allocated buffer will get locked with
io-cache very soon.
Eventually all new requests would get iobufs from new iobuf_pools which
are not
registered with rdma and we will have to do registration for every iobuf.
To address this issue, we can:

             i)  Turn-off io-cache
(we chose this for testing)
            ii)  Use separate buffer for io-cache, and offload from
                default pool to io-cache buffer.
(New thread to offload)
            iii) Dynamically register each newly created arena with rdma,
                 for this need to bring libglusterfs code and transport
layer code together.
                     (Will need changes in packaging and may bring hard
dependencies of rdma libs)
           iv) Increase the default pool size.
                    (Will increase the footprint of glusterfs process)

We implemented two approaches,  (1) and (2i) to get some
performance numbers. The setup was 4*2 distributed-replicated volume
using ram disks as bricks to avoid hard disk bottleneck. And the numbers
are attached with the mail.


Please provide the your thoughts on these approaches.

Regards
Rafi KC


        Seperate buffer for rdma (1)            No change               
Register Default iobuf pool(2i) 
        write   read    io-cache off    write   read    io-cache off    write   
read    io-cache off
1       373     527     656             343     483     532             446     
512     696
2       380     528     668             347     485     540             426     
525     715
3       376     527     594             346     482     540             422     
526     720
4       381     533     597             348     484     540             413     
526     710
5       372     527     479             347     482     538             422     
519     719
Note: (varying result )
Average 376.4   528.4   598.8           346.2   483.2   538             425.8   
521.6   712
                                                                
command read:   echo 3 > /proc/sys/vm/drop_caches; dd 
if=/home/ram0/mount0/foo.txt of=/dev/null bs=1024K count=1000;
        write   echo 3 > /proc/sys/vm/drop_caches; dd 
of=/home/ram0/mount0/foo.txt if=/dev/zero bs=1024K count=1000 conv=sync;

                                                        
vol info        "Volume Name: xcube
Type: Distributed-Replicate
Volume ID: 84cbc80f-bf93-4b10-9865-79a129efe2f5
Status: Started
Snap Volume: no
Number of Bricks: 4 x 2 = 8
Transport-type: rdma
Bricks:
Brick1: 192.168.44.105:/home/ram0/b0
Brick2: 192.168.44.106:/home/ram0/b0
Brick3: 192.168.44.107:/brick/0/b0
Brick4: 192.168.44.108:/brick/0/b0
Brick5: 192.168.44.105:/home/ram1/b1
Brick6: 192.168.44.106:/home/ram1/b1
Brick7: 192.168.44.107:/brick/1/b1
Brick8: 192.168.44.108:/brick/1/b1
Options Reconfigured:
performance.io-cache: on
performance.readdir-ahead: on
snap-max-hard-limit: 256
snap-max-soft-limit: 90
auto-delete: disable    
_______________________________________________
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Reply via email to