Greetings. My cluster consists of 3 nodes. Each node has 4 OSD HDDs with a
capacity of 6 TB each and 1 nvme for db/wal storage. 2 * 10Gbps network
assembled in bond and some parameters changed in rc.local to improve
performance. They are below:
#Set network interface buffer size
ethtool -G eno1 rx 4096 tx 4096; ethtool -G eno2 rx 4096 tx 4096
#Set txqueuelen eno1, eno2, bond0, vmbr0
ip link set eno1 txqueuelen 10000
ip link set eno2 txqueuelen 10000
ip link set bond0 txqueuelen 20000
ip link set vmbr0 txqueuelen 20000
ip link set vmbr0.4040 txqueuelen 10000
ip link set vmbr0.4043 txqueuelen 10000
ip link set vmbr0.4045 txqueuelen 10000
ip link set vmbr0.4053 txqueuelen 10000
#Disable Offload
ethtool -K eno1 gso off gro off lro off tso off
ethtool -K eno2 gso off gro off lro off tso off
ethtool -K eno1 rx on tx on sg on
ethtool -K eno2 rx on tx on sg on
ethtool -L eno1 combined 8
ethtool -L eno2 combined 8
ethtool -K eno1 rxhash on
ethtool -K eno2 rxhash on
rss-ladder eno1 0
rss-ladder eno2 1
With the default settings of ceph.conf, when testing a 1 GB file with the
SEQ1MQ8T1 profile inside the kvm virtual machine, I get the following write
throughput: 152 MB/s
The bandwidth of each of the disks, if you do not use CEPH, is at least 200
MB/s, and I have 4 of them per node. In total, this is 800 MB/s. If we discard
the overhead and abstraction levels, a loss of 25% of performance is possible,
this should be +- 600 MB/s I may be wrong, but I'm purely theoretically
guessing.
When testing and monitoring disk load with the nmon utility, I see write rates
of 30 to 50 MB/s per node disk.
The network cannot be a bottleneck, since the nodes are empty and there is only
1 virtual machine on them that generates the load.
iperf3 from one node to another passes at least 8 Gbit/s, which in the test
will completely allow to drive 4.8 Gbit/s equal to 600 MB/s
Write caching in the virtual machine is disabled.
I started changing the [osd] parameters in the ceph.conf file with these
changes, I managed to increase the read speed by 2 times from 195 MB/s to 550
MB/s, but the write speed did not budge. The conversation now is not about
IOPS, but about the write speed. What needs to be changed? Below is my ceph.conf
[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.50.251.0/24
fsid = 7d00b675-2f1e-47ff-a71c-b95d1745bc39
mon_allow_pool_delete = true
mon_host = 10.50.250.1 10.50.250.2 10.50.250.3
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 10.50.250.0/24
[client]
keyring = /etc/pve/priv/$cluster.$name.keyring
#[osd]
# bluestore_compression_mode = none
# bluestore_cache_autotune = true
# bluestore_cache_size = 3221225472
# bluestore_cache_kv_ratio = 0.2
# bluestore_cache_kv_max = 1610612736
# bluestore_min_alloc_size_hdd = 65536
# bluestore_max_alloc_size_hdd = 67108864
# bluestore_min_alloc_size_ssd = 16384
# bluestore_max_alloc_size_ssd = 33554432
# bluestore_throttle_bytes = 268435456
# bluestore_throttle_deferred_bytes = 1073741824
# bluestore_rocksdb_options =
write_buffer_size=268435456;max_write_buffer_number=4
# osd_op_num_threads_per_shard = 4
# osd_client_message_cap = 1024
# osd_client_message_size_cap = 536870912
[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring
[mds.nd01]
host = nd01
mds_standby_for_name = pve
[mds.nd02]
host = nd02
mds_standby_for_name = pve
[mds.nd03]
host = nd03
mds_standby_for_name = pve
[mon.nd01]
public_addr = 10.50.250.1
[mon.nd02]
public_addr = 10.50.250.2
[mon.nd03]
public_addr = 10.50.250.3
Perhaps I do not quite understand the philosophy, logic and functionality of
CEPH? I am aware that there may be competition for resources and so on. But I
see a clear 50-50 write throughput split when another VM runs a similar test.
And at a speed of 150 MB / s, it is very scary that at some point there will be
competition between virtual machines and the write speed may drop to very low
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]