Re: [ceph-users] ceph all-nvme mysql performance tuning

Maged Mokhtar Mon, 27 Nov 2017 06:36:41 -0800

On 2017-11-27 15:02, German Anders wrote:

> Hi All, 
> 
> I've a performance question, we recently install a brand new Ceph cluster 
> with all-nvme disks, using ceph version 12.2.0 with bluestore configured. The 
> back-end of the cluster is using a bond IPoIB (active/passive) , and for the 
> front-end we are using a bonding config with active/active (20GbE) to 
> communicate with the clients. 
> 
> The cluster configuration is the following: 
> 
> MON NODES: 
> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14  
> 3x 1U servers: 
> 2x Intel Xeon E5-2630v4 @2.2Ghz 
> 128G RAM 
> 2x Intel SSD DC S3520 150G (in RAID-1 for OS) 
> 2x 82599ES 10-Gigabit SFI/SFP+ Network Connection 
> 
> OSD NODES: 
> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 
> 4x 2U servers: 
> 2x Intel Xeon E5-2640v4 @2.4Ghz 
> 128G RAM 
> 2x Intel SSD DC S3520 150G (in RAID-1 for OS) 
> 1x Ethernet Controller 10G X550T 
> 1x 82599ES 10-Gigabit SFI/SFP+ Network Connection 
> 12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons 
> 1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port) 
> 
> Here's the tree: 
> 
> ID CLASS WEIGHT   TYPE NAME          STATUS REWEIGHT PRI-AFF 
> -7       48.00000 root root 
> -5       24.00000     rack rack1 
> -1       12.00000         node cpn01 
> 0  nvme  1.00000             osd.0      up  1.00000 1.00000 
> 1  nvme  1.00000             osd.1      up  1.00000 1.00000 
> 2  nvme  1.00000             osd.2      up  1.00000 1.00000 
> 3  nvme  1.00000             osd.3      up  1.00000 1.00000 
> 4  nvme  1.00000             osd.4      up  1.00000 1.00000 
> 5  nvme  1.00000             osd.5      up  1.00000 1.00000 
> 6  nvme  1.00000             osd.6      up  1.00000 1.00000 
> 7  nvme  1.00000             osd.7      up  1.00000 1.00000 
> 8  nvme  1.00000             osd.8      up  1.00000 1.00000 
> 9  nvme  1.00000             osd.9      up  1.00000 1.00000 
> 10  nvme  1.00000             osd.10     up  1.00000 1.00000 
> 11  nvme  1.00000             osd.11     up  1.00000 1.00000 
> -3       12.00000         node cpn03 
> 24  nvme  1.00000             osd.24     up  1.00000 1.00000 
> 25  nvme  1.00000             osd.25     up  1.00000 1.00000 
> 26  nvme  1.00000             osd.26     up  1.00000 1.00000 
> 27  nvme  1.00000             osd.27     up  1.00000 1.00000 
> 28  nvme  1.00000             osd.28     up  1.00000 1.00000 
> 
> 29  nvme  1.00000             osd.29     up  1.00000 1.00000 
> 30  nvme  1.00000             osd.30     up  1.00000 1.00000 
> 31  nvme  1.00000             osd.31     up  1.00000 1.00000 
> 32  nvme  1.00000             osd.32     up  1.00000 1.00000 
> 33  nvme  1.00000             osd.33     up  1.00000 1.00000 
> 34  nvme  1.00000             osd.34     up  1.00000 1.00000 
> 35  nvme  1.00000             osd.35     up  1.00000 1.00000 
> -6       24.00000     rack rack2 
> -2       12.00000         node cpn02 
> 12  nvme  1.00000             osd.12     up  1.00000 1.00000 
> 13  nvme  1.00000             osd.13     up  1.00000 1.00000 
> 14  nvme  1.00000             osd.14     up  1.00000 1.00000 
> 15  nvme  1.00000             osd.15     up  1.00000 1.00000 
> 16  nvme  1.00000             osd.16     up  1.00000 1.00000 
> 17  nvme  1.00000             osd.17     up  1.00000 1.00000 
> 18  nvme  1.00000             osd.18     up  1.00000 1.00000 
> 19  nvme  1.00000             osd.19     up  1.00000 1.00000 
> 20  nvme  1.00000             osd.20     up  1.00000 1.00000 
> 21  nvme  1.00000             osd.21     up  1.00000 1.00000 
> 22  nvme  1.00000             osd.22     up  1.00000 1.00000 
> 23  nvme  1.00000             osd.23     up  1.00000 1.00000 
> -4       12.00000         node cpn04 
> 36  nvme  1.00000             osd.36     up  1.00000 1.00000 
> 37  nvme  1.00000             osd.37     up  1.00000 1.00000 
> 38  nvme  1.00000             osd.38     up  1.00000 1.00000 
> 39  nvme  1.00000             osd.39     up  1.00000 1.00000 
> 40  nvme  1.00000             osd.40     up  1.00000 1.00000 
> 41  nvme  1.00000             osd.41     up  1.00000 1.00000 
> 42  nvme  1.00000             osd.42     up  1.00000 1.00000 
> 43  nvme  1.00000             osd.43     up  1.00000 1.00000 
> 44  nvme  1.00000             osd.44     up  1.00000 1.00000 
> 45  nvme  1.00000             osd.45     up  1.00000 1.00000 
> 46  nvme  1.00000             osd.46     up  1.00000 1.00000 
> 47  nvme  1.00000             osd.47     up  1.00000 1.00000 
> 
> The disk partition of one of the OSD nodes: 
> 
> NAME                   MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT 
> nvme6n1                259:1    0   1.1T  0 disk 
> ├─nvme6n1p2            259:15   0   1.1T  0 part 
> └─nvme6n1p1            259:13   0   100M  0 part  /var/lib/ceph/osd/ceph-6 
> nvme9n1                259:0    0   1.1T  0 disk 
> ├─nvme9n1p2            259:8    0   1.1T  0 part 
> └─nvme9n1p1            259:7    0   100M  0 part  /var/lib/ceph/osd/ceph-9 
> sdb                      8:16   0 139.8G  0 disk 
> └─sdb1                   8:17   0 139.8G  0 part 
> └─md0                  9:0    0 139.6G  0 raid1 
> ├─md0p2            259:31   0     1K  0 md 
> ├─md0p5            259:32   0 139.1G  0 md 
> │ ├─cpn01--vg-swap 253:1    0  27.4G  0 lvm   [SWAP] 
> │ └─cpn01--vg-root 253:0    0 111.8G  0 lvm   / 
> └─md0p1            259:30   0 486.3M  0 md    /boot 
> nvme11n1               259:2    0   1.1T  0 disk 
> ├─nvme11n1p1           259:12   0   100M  0 part  /var/lib/ceph/osd/ceph-11 
> └─nvme11n1p2           259:14   0   1.1T  0 part 
> nvme2n1                259:6    0   1.1T  0 disk 
> ├─nvme2n1p2            259:21   0   1.1T  0 part 
> └─nvme2n1p1            259:20   0   100M  0 part  /var/lib/ceph/osd/ceph-2 
> nvme5n1                259:3    0   1.1T  0 disk 
> ├─nvme5n1p1            259:9    0   100M  0 part  /var/lib/ceph/osd/ceph-5 
> └─nvme5n1p2            259:10   0   1.1T  0 part 
> nvme8n1                259:24   0   1.1T  0 disk 
> ├─nvme8n1p1            259:26   0   100M  0 part  /var/lib/ceph/osd/ceph-8 
> └─nvme8n1p2            259:28   0   1.1T  0 part 
> nvme10n1               259:11   0   1.1T  0 disk 
> ├─nvme10n1p1           259:22   0   100M  0 part  /var/lib/ceph/osd/ceph-10 
> └─nvme10n1p2           259:23   0   1.1T  0 part 
> nvme1n1                259:33   0   1.1T  0 disk 
> ├─nvme1n1p1            259:34   0   100M  0 part  /var/lib/ceph/osd/ceph-1 
> └─nvme1n1p2            259:35   0   1.1T  0 part 
> nvme4n1                259:5    0   1.1T  0 disk 
> ├─nvme4n1p1            259:18   0   100M  0 part  /var/lib/ceph/osd/ceph-4 
> └─nvme4n1p2            259:19   0   1.1T  0 part 
> nvme7n1                259:25   0   1.1T  0 disk 
> ├─nvme7n1p1            259:27   0   100M  0 part  /var/lib/ceph/osd/ceph-7 
> └─nvme7n1p2            259:29   0   1.1T  0 part 
> sda                      8:0    0 139.8G  0 disk 
> └─sda1                   8:1    0 139.8G  0 part 
> └─md0                  9:0    0 139.6G  0 raid1 
> ├─md0p2            259:31   0     1K  0 md 
> ├─md0p5            259:32   0 139.1G  0 md 
> │ ├─cpn01--vg-swap 253:1    0  27.4G  0 lvm   [SWAP] 
> │ └─cpn01--vg-root 253:0    0 111.8G  0 lvm   / 
> └─md0p1            259:30   0 486.3M  0 md    /boot 
> nvme0n1                259:36   0   1.1T  0 disk 
> ├─nvme0n1p1            259:37   0   100M  0 part  /var/lib/ceph/osd/ceph-0 
> └─nvme0n1p2            259:38   0   1.1T  0 part 
> nvme3n1                259:4    0   1.1T  0 disk 
> ├─nvme3n1p1            259:16   0   100M  0 part  /var/lib/ceph/osd/ceph-3 
> └─nvme3n1p2            259:17   0   1.1T  0 part 
> 
> For the disk scheduler we're using [kyber], for the read_ahead_kb we try 
> different values (0,128 and 2048), the rq_affinity set to 2, and the 
> rotational parameter set to 0. 
> We've also set the CPU governor to performance on all the cores, and tune 
> some sysctl parameters also: 
> 
> # for Ceph 
> net.ipv4.ip_forward=0 
> net.ipv4.conf.default.rp_filter=1 
> kernel.sysrq=0 
> kernel.core_uses_pid=1 
> net.ipv4.tcp_syncookies=0 
> #net.netfilter.nf_conntrack_max=2621440 
> #net.netfilter.nf_conntrack_tcp_timeout_established = 1800 
> # disable netfilter on bridges 
> #net.bridge.bridge-nf-call-ip6tables = 0 
> #net.bridge.bridge-nf-call-iptables = 0 
> #net.bridge.bridge-nf-call-arptables = 0 
> vm.min_free_kbytes=1000000 
> 
> # Controls the maximum size of a message, in bytes 
> kernel.msgmnb = 65536 
> 
> # Controls the default maxmimum size of a mesage queue 
> kernel.msgmax = 65536 
> 
> # Controls the maximum shared segment size, in bytes 
> kernel.shmmax = 68719476736 
> 
> # Controls the maximum number of shared memory segments, in pages 
> kernel.shmall = 4294967296 
> 
> The ceph.conf file is: 
> 
> ... 
> 
> osd_pool_default_size = 3 
> osd_pool_default_min_size = 2 
> osd_pool_default_pg_num = 1600 
> osd_pool_default_pgp_num = 1600 
> 
> debug_crush = 1/1 
> debug_buffer = 0/1 
> debug_timer = 0/0 
> debug_filer = 0/1 
> debug_objecter = 0/1 
> debug_rados = 0/5 
> debug_rbd = 0/5 
> debug_ms = 0/5 
> debug_throttle = 1/1 
> 
> debug_journaler = 0/0 
> debug_objectcatcher = 0/0 
> debug_client = 0/0 
> debug_osd = 0/0 
> debug_optracker = 0/0 
> debug_objclass = 0/0 
> debug_journal = 0/0 
> debug_filestore = 0/0 
> debug_mon = 0/0 
> debug_paxos = 0/0 
> 
> osd_crush_chooseleaf_type = 0 
> filestore_xattr_use_omap = true 
> 
> rbd_cache = true 
> mon_compact_on_trim = false 
> 
> [osd] 
> osd_crush_update_on_start = false 
> 
> [client] 
> rbd_cache = true 
> rbd_cache_writethrough_until_flush = true 
> rbd_default_features = 1 
> admin_socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok 
> log_file = /var/log/ceph/ 
> 
> The cluster has two production pools on for openstack (volumes) with RF of 3 
> and another pool for db (db) with RF of 2. The DBA team has perform several 
> tests with a volume mounted on the DB server (with RBD). The DB server has 
> the following configuration: 
> 
> OS: CentOS 6.9 | kernel 4.14.1 
> DB: MySQL 
> ProLiant BL685c G7 
> 4x AMD Opteron Processor 6376 (total of 64 cores) 
> 128G RAM 
> 1x OneConnect 10Gb NIC (quad-port) - in a bond configuration (active/active) 
> with 3 vlans 
> 
> We also did some tests with SYSBENCH on different storage types: 
> 
> sysbench
> 
> disk
> tps
> qps
> latency (ms) 95th percentile
> 
> Local SSD
> 261,28
> 5.225,61
> 5,18
> 
> Ceph NVMe
> 95,18
> 1.903,53
> 12,3
> 
> Pure Storage
> 196,49
> 3.929,71
> 6,32
> 
> NetApp FAS
> 189,83
> 3.796,59
> 6,67
> 
> EMC VMAX
> 196,14
> 3.922,82
> 6,32
> 
> Is there any specific tuning that I can apply to the ceph cluster, in order 
> to improve those numbers? Or are those numbers ok for the type and size of 
> the cluster that we have? Any advice would be really appreciated. 
> 
> Thanks, 
> 
> German
> 
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Hi, 

What is the value of --num-threads (def value is 1) ? Ceph will be
better with more threads: 32 or 64.
What is the value of --file-block-size (def 16k) and file-test-mode ? If
you are using sequential seqwr/seqrd you will be hitting the same OSD,
so maybe try random (rndrd/rndwr) or better use rbd stripe size of 16kb
(default rbd stripe is 4M). rbd striping is ideal for small block
sequential io pattern typical in databases. 

/Maged

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph all-nvme mysql performance tuning

Reply via email to