Re: [ceph-users] troubleshooting ceph performance

Maged Mokhtar Tue, 30 Jan 2018 23:44:18 -0800

On 2018-01-31 08:14, Manuel Sopena Ballesteros wrote:

> Dear Ceph community, 
> 
> I have a very small ceph cluster for testing with this configuration: 
> 
> ·         2x compute nodes each with: 
> 
> ·         dual port of 25 nic 
> 
> ·         2x socket (56 cores with hyperthreading) 
> 
> ·         X10 intel nvme DC P3500 drives 
> 
> ·         512 GB RAM 
> 
> One of the nodes is also running as a monitor. 
> 
> Installation has been done using ceph-ansible. 
> 
> Ceph version: jewel 
> 
> Storage engine: filestore 
> 
> Performance test below: 
> 
> [root@zeus-59 ceph-block-device]# ceph osd pool ls detail 
> 
> pool 0 'rbd' replicated size 2 min_size 2 crush_ruleset 0 object_hash 
> rjenkins pg_num 64 pgp_num 64 last_change 115 flags hashpspool stripe_width 0 
> 
> pool 1 'images' replicated size 2 min_size 2 crush_ruleset 0 object_hash 
> rjenkins pg_num 128 pgp_num 128 last_change 118 flags hashpspool stripe_width 
> 0 
> 
> removed_snaps [1~3,7~4] 
> 
> pool 3 'backups' replicated size 2 min_size 2 crush_ruleset 0 object_hash 
> rjenkins pg_num 128 pgp_num 128 last_change 120 flags hashpspool stripe_width 
> 0 
> 
> pool 4 'vms' replicated size 2 min_size 2 crush_ruleset 0 object_hash 
> rjenkins pg_num 128 pgp_num 128 last_change 122 flags hashpspool stripe_width 
> 0 
> 
> removed_snaps [1~7] 
> 
> pool 5 'volumes' replicated size 2 min_size 2 crush_ruleset 0 object_hash 
> rjenkins pg_num 128 pgp_num 128 last_change 124 flags hashpspool stripe_width 
> 0 
> 
> removed_snaps [1~3] 
> 
> pool 6 'scbench' replicated size 2 min_size 2 crush_ruleset 0 object_hash 
> rjenkins pg_num 100 pgp_num 100 last_change 126 flags hashpspool stripe_width 
> 0 
> 
> pool 7 'rbdbench' replicated size 2 min_size 2 crush_ruleset 0 object_hash 
> rjenkins pg_num 100 pgp_num 100 last_change 128 flags hashpspool stripe_width 
> 0 
> 
> removed_snaps [1~3] 
> 
> [root@zeus-59 ceph-block-device]# ceph osd tree 
> 
> ID WEIGHT   TYPE NAME        UP/DOWN REWEIGHT PRIMARY-AFFINITY 
> 
> -1 36.17371 root default 
> 
> -2 18.08685     host zeus-58 
> 
> 0  1.80869         osd.0         up  1.00000          1.00000 
> 
> 2  1.80869         osd.2         up  1.00000          1.00000 
> 
> 4  1.80869         osd.4         up  1.00000          1.00000 
> 
> 6  1.80869         osd.6         up  1.00000          1.00000 
> 
> 8  1.80869         osd.8         up  1.00000          1.00000 
> 
> 10  1.80869         osd.10        up  1.00000          1.00000 
> 
> 12  1.80869         osd.12        up  1.00000          1.00000 
> 
> 14  1.80869         osd.14        up  1.00000          1.00000 
> 
> 16  1.80869         osd.16        up  1.00000          1.00000 
> 
> 18  1.80869         osd.18        up  1.00000          1.00000 
> 
> -3 18.08685     host zeus-59 
> 
> 1  1.80869         osd.1         up  1.00000          1.00000 
> 
> 3  1.80869         osd.3         up  1.00000          1.00000 
> 
> 5  1.80869         osd.5         up  1.00000          1.00000 
> 
> 7  1.80869         osd.7         up  1.00000          1.00000 
> 
> 9  1.80869         osd.9         up  1.00000          1.00000 
> 
> 11  1.80869         osd.11        up  1.00000          1.00000 
> 
> 13  1.80869         osd.13        up  1.00000          1.00000 
> 
> 15  1.80869         osd.15        up  1.00000          1.00000 
> 
> 17  1.80869         osd.17        up  1.00000          1.00000 
> 
> 19  1.80869         osd.19        up  1.00000          1.00000 
> 
> [root@zeus-59 ceph-block-device]# ceph status 
> 
> cluster 8e930b6c-455e-4328-872d-cb9f5c0359ae 
> 
> health HEALTH_OK 
> 
> monmap e1: 1 mons at {zeus-59=10.0.32.59:6789/0} 
> 
> election epoch 3, quorum 0 zeus-59 
> 
> osdmap e129: 20 osds: 20 up, 20 in 
> 
> flags sortbitwise,require_jewel_osds 
> 
> pgmap v1166945: 776 pgs, 7 pools, 1183 GB data, 296 kobjects 
> 
> 2363 GB used, 34678 GB / 37042 GB avail 
> 
> 775 active+clean 
> 
> 1 active+clean+scrubbing+deep 
> 
> [root@zeus-59 ceph-block-device]# rados bench -p scbench 10 write 
> --no-cleanup 
> 
> Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 
> for up to 10 seconds or 0 objects 
> 
> Object prefix: benchmark_data_zeus-59.localdomain_2844050 
> 
> sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s) 
> 
> 0       0         0         0         0         0           -           0 
> 
> 1      16       644       628    2511.4      2512   0.0210273    0.025206 
> 
> 2      16      1319      1303   2605.49      2700   0.0238678   0.0243974 
> 
> 3      16      2003      1987   2648.89      2736   0.0201334   0.0240726 
> 
> 4      16      2669      2653   2652.59      2664   0.0258618   0.0240468 
> 
> 5      16      3349      3333   2666.01      2720   0.0189464   0.0239484 
> 
> 6      16      4026      4010   2672.96      2708     0.02215   0.0238954 
> 
> 7      16      4697      4681   2674.49      2684   0.0217258   0.0238887 
> 
> 8      16      5358      5342   2670.64      2644   0.0265384   0.0239066 
> 
> 9      16      6043      6027    2678.3      2740   0.0260798   0.0238637 
> 
> 10      16      6731      6715   2685.64      2752   0.0174624   0.0237982 
> 
> Total time run:         10.026091 
> 
> Total writes made:      6731 
> 
> Write size:             4194304 
> 
> Object size:            4194304 
> 
> Bandwidth (MB/sec):     2685.39 
> 
> Stddev Bandwidth:       70.0286 
> 
> Max bandwidth (MB/sec): 2752 
> 
> Min bandwidth (MB/sec): 2512 
> 
> Average IOPS:           671 
> 
> Stddev IOPS:            17 
> 
> Max IOPS:               688 
> 
> Min IOPS:               628 
> 
> Average Latency(s):     0.023819 
> 
> Stddev Latency(s):      0.00463709 
> 
> Max latency(s):         0.0594516 
> 
> Min latency(s):         0.0138556 
> 
> [root@zeus-59 ceph-block-device]# rados bench -p scbench 10 seq 
> 
> sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s) 
> 
> 0       0         0         0         0         0           -           0 
> 
> 1      15      1150      1135   4498.75      4540   0.0146433   0.0131456 
> 
> 2      15      2313      2298   4571.38      4652   0.0144489   0.0131564 
> 
> 3      15      3468      3453   4585.68      4620  0.00975626   0.0131211 
> 
> 4      15      4663      4648   4633.41      4780   0.0163181   0.0130076 
> 
> 5      15      5949      5934   4734.49      5144  0.00944718   0.0127327 
> 
> Total time run:       5.643929 
> 
> Total reads made:     6731 
> 
> Read size:            4194304 
> 
> Object size:          4194304 
> 
> Bandwidth (MB/sec):   4770.43 
> 
> Average IOPS          1192 
> 
> Stddev IOPS:          59 
> 
> Max IOPS:             1286 
> 
> Min IOPS:             1135 
> 
> Average Latency(s):   0.0126349 
> 
> Max latency(s):       0.0490061 
> 
> Min latency(s):       0.00613382 
> 
> [root@zeus-59 ceph-block-device]# rados bench -p scbench 10 rand 
> 
> sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s) 
> 
> 0       0         0         0         0         0           -           0 
> 
> 1      15      1197      1182    4726.8      4728   0.0130331    0.012711 
> 
> 2      15      2364      2349   4697.02      4668   0.0105971   0.0128123 
> 
> 3      15      3686      3671   4893.78      5288  0.00906867   0.0123103 
> 
> 4      15      4994      4979   4978.16      5232  0.00946901    0.012104 
> 
> 5      15      6302      6287   5028.83      5232   0.0115159   0.0119879 
> 
> 6      15      7620      7605   5069.28      5272  0.00986636   0.0118935 
> 
> 7      15      8912      8897   5083.31      5168   0.0106201   0.0118648 
> 
> 8      15     10185     10170   5084.34      5092   0.0116891   0.0118632 
> 
> 9      15     11484     11469   5096.68      5196  0.00911787   0.0118354 
> 
> 10      16     12748     12732   5092.16      5052   0.0111988   0.0118476 
> 
> Total time run:       10.020135 
> 
> Total reads made:     12748 
> 
> Read size:            4194304 
> 
> Object size:          4194304 
> 
> Bandwidth (MB/sec):   5088.95 
> 
> Average IOPS:         1272 
> 
> Stddev IOPS:          55 
> 
> Max IOPS:             1322 
> 
> Min IOPS:             1167 
> 
> Average Latency(s):   0.0118531 
> 
> Max latency(s):       0.0441046 
> 
> Min latency(s):       0.00590162 
> 
> [root@zeus-59 ceph-block-device]# rbd bench-write image01 --pool=rbdbench 
> 
> bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern sequential 
> 
> SEC       OPS   OPS/SEC   BYTES/SEC 
> 
> 1     56159  56180.51  230115361.66 
> 
> 2    119975  59998.28  245752967.01 
> 
> 3    182956  60990.78  249818235.33 
> 
> 4    244195  61054.17  250077889.88 
> 
> elapsed:     4  ops:   262144  ops/sec: 60006.56  bytes/sec: 245786880.86 
> 
> [root@zeus-59 ceph-block-device]# 
> 
> I am far from a ceph/storage expert but my feeling is that the numbers 
> provided by rbd bench-write are quite poor considering the hardware I am 
> using (please correct me if I am wrong). 
> 
> I would like to ask for some help from the community in order to dig into 
> this issue and find what is throttling the performance (cpu? Memory? Network 
> configuration? Not enough data nodes? Not enough OSDs per disk? Cpu pinning? 
> Etc.). 
> 
> Apologies beforehand as I know this is a quite a broad topic and not easy to 
> give an exact answer but I would like to have some guidance and hope we can 
> make an interesting topic for performance troubleshooting for other people 
> who is learning distributed storage and ceph. 
> 
> Thank you very much 
> 
> MANUEL SOPENA BALLESTEROS | Systems engineer
> GARVAN INSTITUTE OF MEDICAL RESEARCH 
> The Kinghorn Cancer Centre, 370 Victoria Street, Darlinghurst, NSW 2010
> T: + 61 (0)2 9355 5760 | F: +61 (0)2 9295 8507 | E: [email protected] 
> 
> NOTICE 
> Please consider the environment before printing this email. This message and 
> any attachments are intended for the addressee named and may contain legally 
> privileged/confidential/copyright information. If you are not the intended 
> recipient, you should not read, use, disclose, copy or distribute this 
> communication. If you have received this message in error please notify us at 
> once by return email and then delete both messages. We accept no liability 
> for the distribution of viruses or similar in electronic communications. This 
> notice should not be removed. 
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Hi Manuel 

To get the complete picture, you should run a tool to measure resource
load while you are performing the rados benchmark. You could use tools
like atop/sysstat/collectl. The most important metrics to watch is cpu %
utilization, disk % utilization (% busy), and interface throuput ( or %
util if using sysstat ). This will show you where the bottleneck is,
note that the bottleck can be different for different io patterns (
seq/rand/4k-4Mblock sizes ). 

Some other points: 

For filestore there are double writes due journal, for collocated
journal you half your bandwidth and reduce iops by nearly 3 (3 seeks per
write). If you add in your replica writes there is another factor of 2. 
For your 2.5GB/s rand write, each disk will be doing 500MB/s. For 60K
iops rand write, each disk will do about 18K write iops. Bluestore will
give you better performance since it does not have double writes. 

For reads the 4.5 GB/s you are seeing gives 225 MB/s per disk which is
low, but i suspect this is due to the default thread count of 16 which
is lower than your disk count, so some will be idle. You should increase
your threads to 64 or 128. The 4.5GB/s could also be a limit from your
client machine, the rados benchmark is a client side application, in
many cases the bottleneck will be due to client resources. Best run
several clients and aggregate their result, do not run clients from the
same nodes as OSDs else your resources will be double loaded. Keep
adding more clients until your total cluster output saturates. There is
a tool than can automate this if you need that level of detail:
https://github.com/ceph/cbt 

Last i would recommend running fio to test raw disk performance, note
the journal does sequential writes with O_DSYNC: 

4M sequential write O_DSYNC 1 Worker:
fio --name=xx --filename=/dev/sdX --iodepth=1 --rw=write --bs=4M
--direct=1 --sync=1 --runtime=20 --time_based --numjobs=1
--group_reporting 

4M random write 1 Worker:
fio --name=xx --filename=/dev/sdX --iodepth=1 --rw=randwrite --bs=4M
--direct=1 --runtime=20 --time_based --numjobs=1 --group_reporting 

4M sequential write O_DSYNC 8 Workers:
fio --name=xx --filename=/dev/sdX --iodepth=1 --rw=write --bs=4M
--direct=1 --sync=1 --runtime=20 --time_based --numjobs=8
--group_reporting 

4M random write 8 Workers:
fio --name=xx --filename=/dev/sdX --iodepth=1 --rw=randwrite --bs=4M
--direct=1 --runtime=20 --time_based --numjobs=8 --group_reporting 

Hope this helps 

Maged

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] troubleshooting ceph performance

Reply via email to