We discovered recently that our CephFS mount appeared to be halting reads when 
writes were being synched to the Ceph cluster to the point it was affecting 
applications. 

I also posted this as a Gist with embedded graph images to help illustrate: 
https://gist.github.com/keeperAndy/aa80d41618caa4394e028478f4ad1694

The following is the plain text from the Gist. 

First, details about the host:

````
    $ uname -r
    4.16.13-041613-generic

    $ egrep 'xfs|ceph' /proc/mounts
    192.168.1.115:6789,192.168.1.116:6789,192.168.1.117:6789:/ /cephfs ceph 
rw,noatime,name=cephfs,secret=<hidden>,rbytes,acl,wsize=16777216 0 0
    /dev/mapper/tst01-lvidmt01 /rbd_xfs xfs 
rw,relatime,attr2,inode64,logbsize=256k,sunit=512,swidth=1024,noquota 0 0

    $ ceph -v
    ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous 
(stable)

    $ cat /proc/net/bonding/bond1
    Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

    Bonding Mode: adaptive load balancing
    Primary Slave: None
    Currently Active Slave: net6
    MII Status: up
    MII Polling Interval (ms): 100
    Up Delay (ms): 200
    Down Delay (ms): 200

    Slave Interface: net8
    MII Status: up
    Speed: 10000 Mbps
    Duplex: full
    Link Failure Count: 2
    Permanent HW addr: e4:1d:2d:17:71:e1
    Slave queue ID: 0

    Slave Interface: net6
    MII Status: up
    Speed: 10000 Mbps
    Duplex: full
    Link Failure Count: 1
    Permanent HW addr: e4:1d:2d:17:71:e0
    Slave queue ID: 0

````

We had CephFS mounted alongside an XFS filesystem made up of 16 RBD images 
aggregated under LVM as our storage targets. The link to the Ceph cluster from 
the host is a mode 6 2x10GbE bond (bond1 above).

We started capturing network counters from the Ceph cluster connection (bond1) 
on the host using ifstat at its most granular setting of 0.1 (sampling every 
tenth of a second). We then ran various overlapping read and write operations 
in separate shells on the same host to obtain samples of how our different 
means of accessing Ceph handled this. We converted our ifstat output to CSV and 
insterted it into a spreadsheet to visualize the network activity.

We found that the CephFS kernel mount did indeed appear to pause ongoing reads 
when writes were being flushed from the page cache to the Ceph cluster. 

We wanted to see if we could make this more pronounced, so we added a 6Gb-limit 
tc filter to the interface and re-ran our tests. This yielded much lengthier 
delay periods in the reads while the writes were more slowly flushed from the 
page cache to the Ceph cluster. 

A more restrictive 2Gbit-limit tc filter produced much lengthier delays of our 
reads as the writes were synched to the cluster.

When we tested the same I/O on the RBD-backed XFS file system on the same host, 
we found a very different pattern. The reads seemed to be given priority over 
the write activity, but the writes were only slowed, they were not halted.

Finally we tested overlapping SMB client reads and writes to a Samba share that 
used the userspace libceph-based VFS_Ceph module to produce the share. In this 
case, while raw throughput was lower than that of the kernel, the reads and 
writes did not interrupt each other at all. 

Is this expected behavior for the CephFS kernel drivers? Can a CephFS kernel 
client really not read and write to the file system simultaneously?

Thanks,
Andrew Richards
Senior Systems Engineer
Keeper Technology, LLC

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to