On Mon, Nov 20, 2017 at 2:36 PM, Christian Balzer <[email protected]> wrote:
> On Mon, 20 Nov 2017 14:02:30 +0200 Rudi Ahlers wrote: > > > We're planning on installing 12X Virtual Machines with some heavy loads. > > > > the SSD drives are INTEL SSDSC2BA400G4 > > > Interesting, where did you find those? > Or did you have them lying around? > > I've been unable to get DC S3710 SSDs for nearly a year now. > In South Africa, one of our suppliers had some in stock. They're still fairly new, about 2 months old now. > The SATA drives are ST8000NM0055-1RM112 > > > Note that these (while fast) have an internal flash cache, limiting them to > something like 0.2 DWPD. > Probably not an issue with the WAL/DB on the Intels, but something to keep > in mind. > I don't quite understand what you want to say, please explain? > > Please explain your comment, "b) will find a lot of people here who don't > > approve of it." > > > Read the archives. > Converged clusters are complex and debugging Ceph when tons of other > things are going on at the same time on the machine even more so. > Ok, so I have 4 physical servers and need to setup a highly redundant cluster. How else would you have done it? There is no budget for a SAN, let alone a highly available SAN. > > > I don't have access to the switches right now, but they're new so > whatever > > default config ships from factory would be active. Though iperf shows > 10.5 > > GBytes / 9.02 Gbits/sec throughput. > > > Didn't think it was the switches, but completeness sake and all that. > > > What speeds would you expect? > > "Though with your setup I would have expected something faster, but NOT > the > > theoretical 600MB/s 4 HDDs will do in sequential writes." > > > What I wrote. > A 7200RPM HDD, even these, can not sustain writes much over 170MB/s, in > the most optimal circumstances. > So your cluster can NOT exceed about 600MB/s sustained writes with the > effective bandwidth of 4 HDDs. > Smaller writes/reads that can be cached by RAM, DB, onboard caches on the > HDDs of course can and will be faster. > > But again, you're missing the point, even if you get 600MB/s writes out of > your cluster, the number of 4k IOPS will be much more relevant to your VMs. > > hdparm shows about 230MB/s: ^Croot@virt2:~# hdparm -Tt /dev/sda /dev/sda: Timing cached reads: 20250 MB in 2.00 seconds = 10134.81 MB/sec Timing buffered disk reads: 680 MB in 3.00 seconds = 226.50 MB/sec 600MB/s would be super nice, but in reality even 400MB/s would be nice. Would it not be achievable? > > > > > > On this, "If an OSD has no fast WAL/DB, it will drag the overall speed > > down. Verify and if so fix this and re-test.": how? > > > No idea, I don't do bluestore. > You noticed the lack of a WAL/DB for sda, go and fix it. > If in in doubt by destroying and re-creating. > > And if you're looking for a less invasive procedure, docs and the ML > archive, but AFAIK there is nothing but re-creation at this time. > Since I use Proxmox, which setup a DB device, but not a WAL device. > Christian > > > > On Mon, Nov 20, 2017 at 1:44 PM, Christian Balzer <[email protected]> wrote: > > > > > On Mon, 20 Nov 2017 12:38:55 +0200 Rudi Ahlers wrote: > > > > > > > Hi, > > > > > > > > Can someone please help me, how do I improve performance on ou CEPH > > > cluster? > > > > > > > > The hardware in use are as follows: > > > > 3x SuperMicro servers with the following configuration > > > > 12Core Dual XEON 2.2Ghz > > > Faster cores is better for Ceph, IMNSHO. > > > Though with main storage on HDDs, this will do. > > > > > > > 128GB RAM > > > Overkill for Ceph but I see something else below... > > > > > > > 2x 400GB Intel DC SSD drives > > > Exact model please. > > > > > > > 4x 8TB Seagate 7200rpm 6Gbps SATA HDD's > > > One hopes that's a non SMR one. > > > Model please. > > > > > > > 1x SuperMicro DOM for Proxmox / Debian OS > > > Ah, Proxmox. > > > I'm personally not averse to converged, high density, multi-role > clusters > > > myself, but you: > > > a) need to know what you're doing and > > > b) will find a lot of people here who don't approve of it. > > > > > > I've avoided DOMs so far (non-hotswapable SPOF), even though the SM > ones > > > look good on paper with regards to endurance and IOPS. > > > The later being rather important for your monitors. > > > > > > > 4x Port 10Gbe NIC > > > > Cisco 10Gbe switch. > > > > > > > Configuration would be nice for those, LACP? > > > > > > > > > > > root@virt2:~# rados bench -p Data 10 write --no-cleanup > > > > hints = 1 > > > > Maintaining 16 concurrent writes of 4194304 bytes to objects of size > > > > 4194304 for up to 10 seconds or 0 objects > > > > > > rados bench is limited tool and measuring bandwidth is in nearly all > > > the use cases pointless. > > > Latency is where it is at and testing from inside a VM is more relevant > > > than synthetic tests of the storage. > > > But it is a start. > > > > > > > Object prefix: benchmark_data_virt2_39099 > > > > sec Cur ops started finished avg MB/s cur MB/s last lat(s) > avg > > > > lat(s) > > > > 0 0 0 0 0 0 - > > > > 0 > > > > 1 16 85 69 275.979 276 0.185576 > > > > 0.204146 > > > > 2 16 171 155 309.966 344 0.0625409 > > > > 0.193558 > > > > 3 16 243 227 302.633 288 0.0547129 > > > > 0.19835 > > > > 4 16 330 314 313.965 348 0.0959492 > > > > 0.199825 > > > > 5 16 413 397 317.565 332 0.124908 > > > > 0.196191 > > > > 6 16 494 478 318.633 324 0.1556 > > > > 0.197014 > > > > 7 15 591 576 329.109 392 0.136305 > > > > 0.192192 > > > > 8 16 670 654 326.965 312 0.0703808 > > > > 0.190643 > > > > 9 16 757 741 329.297 348 0.165211 > > > > 0.192183 > > > > 10 16 828 812 324.764 284 0.0935803 > > > > 0.194041 > > > > Total time run: 10.120215 > > > > Total writes made: 829 > > > > Write size: 4194304 > > > > Object size: 4194304 > > > > Bandwidth (MB/sec): 327.661 > > > What part of this surprises you? > > > > > > With a replication of 3, you have effectively the bandwidth of your 2 > SSDs > > > (for small writes, not the case here) and the bandwidth of your 4 HDDs > > > available. > > > Given overhead, other inefficiencies and the fact that this is not a > > > sequential write from the HDD perspective, 320MB/s isn't all that bad. > > > Though with your setup I would have expected something faster, but NOT > the > > > theoretical 600MB/s 4 HDDs will do in sequential writes. > > > > > > > Stddev Bandwidth: 35.8664 > > > > Max bandwidth (MB/sec): 392 > > > > Min bandwidth (MB/sec): 276 > > > > Average IOPS: 81 > > > > Stddev IOPS: 8 > > > > Max IOPS: 98 > > > > Min IOPS: 69 > > > > Average Latency(s): 0.195191 > > > > Stddev Latency(s): 0.0830062 > > > > Max latency(s): 0.481448 > > > > Min latency(s): 0.0414858 > > > > root@virt2:~# hdparm -I /dev/sda > > > > > > > > > > > > > > > > root@virt2:~# ceph osd tree > > > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > > > > -1 72.78290 root default > > > > -3 29.11316 host virt1 > > > > 1 hdd 7.27829 osd.1 up 1.00000 1.00000 > > > > 2 hdd 7.27829 osd.2 up 1.00000 1.00000 > > > > 3 hdd 7.27829 osd.3 up 1.00000 1.00000 > > > > 4 hdd 7.27829 osd.4 up 1.00000 1.00000 > > > > -5 21.83487 host virt2 > > > > 5 hdd 7.27829 osd.5 up 1.00000 1.00000 > > > > 6 hdd 7.27829 osd.6 up 1.00000 1.00000 > > > > 7 hdd 7.27829 osd.7 up 1.00000 1.00000 > > > > -7 21.83487 host virt3 > > > > 8 hdd 7.27829 osd.8 up 1.00000 1.00000 > > > > 9 hdd 7.27829 osd.9 up 1.00000 1.00000 > > > > 10 hdd 7.27829 osd.10 up 1.00000 1.00000 > > > > 0 0 osd.0 down 0 1.00000 > > > > > > > > > > > > root@virt2:~# ceph -s > > > > cluster: > > > > id: 278a2e9c-0578-428f-bd5b-3bb348923c27 > > > > health: HEALTH_OK > > > > > > > > services: > > > > mon: 3 daemons, quorum virt1,virt2,virt3 > > > > mgr: virt1(active) > > > > osd: 11 osds: 10 up, 10 in > > > > > > > > data: > > > > pools: 1 pools, 512 pgs > > > > objects: 6084 objects, 24105 MB > > > > usage: 92822 MB used, 74438 GB / 74529 GB avail > > > > pgs: 512 active+clean > > > > > > > > root@virt2:~# ceph -w > > > > cluster: > > > > id: 278a2e9c-0578-428f-bd5b-3bb348923c27 > > > > health: HEALTH_OK > > > > > > > > services: > > > > mon: 3 daemons, quorum virt1,virt2,virt3 > > > > mgr: virt1(active) > > > > osd: 11 osds: 10 up, 10 in > > > > > > > > data: > > > > pools: 1 pools, 512 pgs > > > > objects: 6084 objects, 24105 MB > > > > usage: 92822 MB used, 74438 GB / 74529 GB avail > > > > pgs: 512 active+clean > > > > > > > > > > > > 2017-11-20 12:32:08.199450 mon.virt1 [INF] mon.1 10.10.10.82:6789/0 > > > > > > > > > > > > > > > > The SSD drives are used as journal drives: > > > > > > > Bluestore has no journals, don't confuse it and the people you're > asking > > > for help. > > > > > > > root@virt3:~# ceph-disk list | grep /dev/sde | grep osd > > > > /dev/sdb1 ceph data, active, cluster ceph, osd.8, block /dev/sdb2, > > > > block.db /dev/sde1 > > > > root@virt3:~# ceph-disk list | grep /dev/sdf | grep osd > > > > /dev/sdc1 ceph data, active, cluster ceph, osd.9, block /dev/sdc2, > > > > block.db /dev/sdf1 > > > > /dev/sdd1 ceph data, active, cluster ceph, osd.10, block /dev/sdd2, > > > > block.db /dev/sdf2 > > > > > > > > > > > > > > > > I see now /dev/sda doesn't have a journal, though it should have. Not > > > sure > > > > why. > > > If an OSD has no fast WAL/DB, it will drag the overall speed down. > > > > > > Verify and if so fix this and re-test. > > > > > > Christian > > > > > > > This is the command I used to create it: > > > > > > > > > > > > pveceph createosd /dev/sda -bluestore 1 -journal_dev /dev/sde > > > > > > > > > > > > > > > > > -- > > > Christian Balzer Network/Systems Engineer > > > [email protected] Rakuten Communications > > > > > > > > > > > > -- > Christian Balzer Network/Systems Engineer > [email protected] Rakuten Communications > -- Kind Regards Rudi Ahlers Website: http://www.rudiahlers.co.za
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
