You are aware of this:
https://yourcmc.ru/wiki/Ceph_performance
I am having these results with ssd and 2.2GHz xeon and no cpu
state/freq/cpugovernor optimalization, so your results with hdd look quite ok
to me.
[@c01 ~]# rados -p rbd.ssd bench 30 write
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304
for up to 30 seconds or 0 objects
Object prefix: benchmark_data_c01_2752661
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 162 146 583.839 584 0.0807733 0.106959
2 16 347 331 661.868 740 0.052621 0.0943461
3 16 525 509 678.552 712 0.0493101 0.0934826
4 16 676 660 659.897 604 0.107205 0.0958496
...
Total time run: 30.0622
Total writes made: 4454
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 592.638
Stddev Bandwidth: 65.0681
Max bandwidth (MB/sec): 740
Min bandwidth (MB/sec): 440
Average IOPS: 148
Stddev IOPS: 16.267
Max IOPS: 185
Min IOPS: 110
Average Latency(s): 0.107988
Stddev Latency(s): 0.0610883
Max latency(s): 0.452039
Min latency(s): 0.0209312
Cleaning up (deleting benchmark objects)
Removed 4454 objects
Clean up completed and total clean up time :0.732456
> Subject: [ceph-users] CEPH 16.2.x: disappointing I/O performance
>
> Hi,
>
> I built a CEPH 16.2.x cluster with relatively fast and modern hardware,
> and
> its performance is kind of disappointing. I would very much appreciate
> an
> advice and/or pointers :-)
>
> The hardware is 3 x Supermicro SSG-6029P nodes, each equipped with:
>
> 2 x Intel(R) Xeon(R) Gold 5220R CPUs
> 384 GB RAM
> 2 x boot drives
> 2 x 1.6 TB Micron 7300 MTFDHBE1T6TDG drives (DB/WAL)
> 2 x 6.4 TB Micron 7300 MTFDHBE6T4TDG drives (storage tier)
> 9 x Toshiba MG06SCA10TE 9TB HDDs, write cache off (storage tier)
> 2 x Intel XL710 NICs connected to a pair of 40/100GE switches
>
> All 3 nodes are running Ubuntu 20.04 LTS with the latest 5.4 kernel,
> apparmor is disabled, energy-saving features are disabled. The network
> between the CEPH nodes is 40G, CEPH access network is 40G, the average
> latencies are < 0.15 ms. I've personally tested the network for
> throughput,
> latency and loss, and can tell that it's operating as expected and
> doesn't
> exhibit any issues at idle or under load.
>
> The CEPH cluster is set up with 2 storage classes, NVME and HDD, with 2
> smaller NVME drives in each node used as DB/WAL and each HDD allocated .
> ceph osd tree output:
>
> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-
> AFF
> -1 288.37488 root default
> -13 288.37488 datacenter ste
> -14 288.37488 rack rack01
> -7 96.12495 host ceph01
> 0 hdd 9.38680 osd.0 up 1.00000
> 1.00000
> 1 hdd 9.38680 osd.1 up 1.00000
> 1.00000
> 2 hdd 9.38680 osd.2 up 1.00000
> 1.00000
> 3 hdd 9.38680 osd.3 up 1.00000
> 1.00000
> 4 hdd 9.38680 osd.4 up 1.00000
> 1.00000
> 5 hdd 9.38680 osd.5 up 1.00000
> 1.00000
> 6 hdd 9.38680 osd.6 up 1.00000
> 1.00000
> 7 hdd 9.38680 osd.7 up 1.00000
> 1.00000
> 8 hdd 9.38680 osd.8 up 1.00000
> 1.00000
> 9 nvme 5.82190 osd.9 up 1.00000
> 1.00000
> 10 nvme 5.82190 osd.10 up 1.00000
> 1.00000
> -10 96.12495 host ceph02
> 11 hdd 9.38680 osd.11 up 1.00000
> 1.00000
> 12 hdd 9.38680 osd.12 up 1.00000
> 1.00000
> 13 hdd 9.38680 osd.13 up 1.00000
> 1.00000
> 14 hdd 9.38680 osd.14 up 1.00000
> 1.00000
> 15 hdd 9.38680 osd.15 up 1.00000
> 1.00000
> 16 hdd 9.38680 osd.16 up 1.00000
> 1.00000
> 17 hdd 9.38680 osd.17 up 1.00000
> 1.00000
> 18 hdd 9.38680 osd.18 up 1.00000
> 1.00000
> 19 hdd 9.38680 osd.19 up 1.00000
> 1.00000
> 20 nvme 5.82190 osd.20 up 1.00000
> 1.00000
> 21 nvme 5.82190 osd.21 up 1.00000
> 1.00000
> -3 96.12495 host ceph03
> 22 hdd 9.38680 osd.22 up 1.00000
> 1.00000
> 23 hdd 9.38680 osd.23 up 1.00000
> 1.00000
> 24 hdd 9.38680 osd.24 up 1.00000
> 1.00000
> 25 hdd 9.38680 osd.25 up 1.00000
> 1.00000
> 26 hdd 9.38680 osd.26 up 1.00000
> 1.00000
> 27 hdd 9.38680 osd.27 up 1.00000
> 1.00000
> 28 hdd 9.38680 osd.28 up 1.00000
> 1.00000
> 29 hdd 9.38680 osd.29 up 1.00000
> 1.00000
> 30 hdd 9.38680 osd.30 up 1.00000
> 1.00000
> 31 nvme 5.82190 osd.31 up 1.00000
> 1.00000
> 32 nvme 5.82190 osd.32 up 1.00000
> 1.00000
>
> ceph df:
>
> --- RAW STORAGE ---
> CLASS SIZE AVAIL USED RAW USED %RAW USED
> hdd 253 TiB 241 TiB 13 TiB 13 TiB 5.00
> nvme 35 TiB 35 TiB 82 GiB 82 GiB 0.23
> TOTAL 288 TiB 276 TiB 13 TiB 13 TiB 4.42
>
> --- POOLS ---
> POOL ID PGS STORED OBJECTS USED %USED MAX
> AVAIL
> images 12 256 24 GiB 3.15k 73 GiB 0.03 76
> TiB
> volumes 13 256 839 GiB 232.16k 2.5 TiB 1.07 76
> TiB
> backups 14 256 31 GiB 8.56k 94 GiB 0.04 76
> TiB
> vms 15 256 752 GiB 198.80k 2.2 TiB 0.96 76
> TiB
> device_health_metrics 16 32 35 MiB 39 106 MiB 0 76
> TiB
> volumes-nvme 17 256 28 GiB 7.21k 81 GiB 0.24 11
> TiB
> ec-volumes-meta 18 256 27 KiB 4 92 KiB 0 76
> TiB
> ec-volumes-data 19 256 8 KiB 1 12 KiB 0 152
> TiB
>
> Please disregard the ec-pools, as they're not currently in use. All
> other
> pools are configured with min_size=2, size=3. All pools are bound to HDD
> storage except for 'volumes-nvme', which is bound to NVME. The number of
> PGs was increased recently, as with autoscaler I was getting a very
> uneven
> PG distribution on devices and we're expecting to add 3 more nodes of
> exactly the same configuration in the coming weeks. I have to emphasize
> that I tested different PG numbers and they didn't have a noticeable
> impact
> on the cluster performance.
>
> The main issue is that this beautiful cluster isn't very fast. When I
> test
> against the 'volumes' pool, residing on HDD storage class (HDDs with
> DB/WAL
> on NVME), I get unexpectedly low throughput numbers:
>
> > rados -p volumes bench 30 write --no-cleanup
> ...
> Total time run: 30.3078
> Total writes made: 3731
> Write size: 4194304
> Object size: 4194304
> Bandwidth (MB/sec): 492.415
> Stddev Bandwidth: 161.777
> Max bandwidth (MB/sec): 820
> Min bandwidth (MB/sec): 204
> Average IOPS: 123
> Stddev IOPS: 40.4442
> Max IOPS: 205
> Min IOPS: 51
> Average Latency(s): 0.129115
> Stddev Latency(s): 0.143881
> Max latency(s): 1.35669
> Min latency(s): 0.0228179
>
> > rados -p volumes bench 30 seq --no-cleanup
> ...
> Total time run: 14.7272
> Total reads made: 3731
> Read size: 4194304
> Object size: 4194304
> Bandwidth (MB/sec): 1013.36
> Average IOPS: 253
> Stddev IOPS: 63.8709
> Max IOPS: 323
> Min IOPS: 91
> Average Latency(s): 0.0625202
> Max latency(s): 0.551629
> Min latency(s): 0.010683
>
> On average, I get around 550 MB/s writes and 800 MB/s reads with 16
> threads
> and 4MB blocks. The numbers don't look fantastic for this hardware, I
> can
> actually push over 8 GB/s of throughput with fio, 16 threads and 4MB
> blocks
> from an RBD client (KVM Linux VM) connected over a low-latency 40G
> network,
> probably hitting some OSD caches there:
>
> READ: bw=8525MiB/s (8939MB/s), 58.8MiB/s-1009MiB/s (61.7MB/s-
> 1058MB/s),
> io=501GiB (538GB), run=60001-60153msec
> Disk stats (read/write):
> vdc: ios=48163/0, merge=6027/0, ticks=1400509/0, in_queue=1305092,
> util=99.48%
>
> The issue manifests when the same client does something closer to real-
> life
> usage, like a single-thread write or read with 4KB blocks, as if using
> for
> example ext4 file system:
>
> > fio --name=ttt --ioengine=posixaio --rw=write --bs=4k --numjobs=1
> --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
> ...
> Run status group 0 (all jobs):
> WRITE: bw=120MiB/s (126MB/s), 120MiB/s-120MiB/s (126MB/s-126MB/s),
> io=7694MiB (8067MB), run=64079-64079msec
> Disk stats (read/write):
> vdc: ios=0/6985, merge=0/406, ticks=0/3062535, in_queue=3048216,
> util=77.31%
>
> > fio --name=ttt --ioengine=posixaio --rw=read --bs=4k --numjobs=1
> --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
> ...
> Run status group 0 (all jobs):
> READ: bw=54.0MiB/s (56.7MB/s), 54.0MiB/s-54.0MiB/s (56.7MB/s-
> 56.7MB/s),
> io=3242MiB (3399MB), run=60001-60001msec
> Disk stats (read/write):
> vdc: ios=12952/3, merge=0/1, ticks=81706/1, in_queue=56336,
> util=99.13%
>
> And this is a total disaster: the IOPS look decent, but the bandwidth is
> unexpectedly very very low. I just don't understand why a single RBD
> client
> writes at 120 MB/s (sometimes slower), and 50 MB/s reads look like a bad
> joke ¯\_(ツ)_/¯
>
> When I run these benchmarks, nothing seems to be overloaded, things like
> CPU and network are barely utilized, OSD latencies don't show anything
> unusual. Thus I am puzzled with these results, as in my opinion SAS HDDs
> with DB/WAL on NVME drives should produce better I/O bandwidth, both for
> writes and reads. I mean, I can easily get much better performance from
> a
> single HDD shared over network via NFS or iSCSI.
>
> I am open to suggestions and would very much appreciate comments and/or
> an
> advice on how to improve the cluster performance.
>
> Best regards,
> Zakhar
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]