Ok, so it seems an MTU of 9000 didn't improve anything. On Mon, Nov 20, 2017 at 5:34 PM, Sébastien VIGNERON < [email protected]> wrote:
> Your performance hit can be from here. When OSD daemons tries to send a > big frame, MTU misconfiguration blocks them and they must send them again > with a lower size. > On some switches, you have to set the global and the per-interface MTU > sizes. > > Cordialement / Best regards, > > Sébastien VIGNERON > CRIANN, > Ingénieur / Engineer > Technopôle du Madrillet > 745, avenue de l'Université > <https://maps.google.com/?q=745,+avenue+de+l'Universit%C3%A9%C2%A0+76800+Saint-Etienne+du+Rouvray+-+France&entry=gmail&source=g> > > 76800 Saint-Etienne du Rouvray - France > <https://maps.google.com/?q=745,+avenue+de+l'Universit%C3%A9%C2%A0+76800+Saint-Etienne+du+Rouvray+-+France&entry=gmail&source=g> > > tél. +33 2 32 91 42 91 <+33%202%2032%2091%2042%2091> > fax. +33 2 32 91 42 92 <+33%202%2032%2091%2042%2092> > http://www.criann.fr > mailto:[email protected] <[email protected]> > support: [email protected] > > Le 20 nov. 2017 à 16:21, Rudi Ahlers <[email protected]> a écrit : > > I am not sure why, but I cannot get Jumbo Frames to work properly: > > > root@virt2:~# ping -M do -s 8972 -c 4 10.10.10.83 > PING 10.10.10.83 (10.10.10.83) 8972(9000) bytes of data. > ping: local error: Message too long, mtu=1500 > ping: local error: Message too long, mtu=1500 > ping: local error: Message too long, mtu=1500 > > > Jumbo Frames is on, on the switch and on the NIC's: > > ens2f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000 > inet 10.10.10.83 netmask 255.255.255.0 broadcast 10.10.10.255 > inet6 fe80::ec4:7aff:feea:7b40 prefixlen 64 scopeid 0x20<link> > ether 0c:c4:7a:ea:7b:40 txqueuelen 1000 (Ethernet) > RX packets 166440655 bytes 229547410625 (213.7 GiB) > RX errors 0 dropped 223 overruns 0 frame 0 > TX packets 142788790 bytes 188658602086 (175.7 GiB) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > > > > root@virt2:~# ifconfig ens2f0 > ens2f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000 > inet 10.10.10.82 netmask 255.255.255.0 broadcast 10.10.10.255 > inet6 fe80::ec4:7aff:feea:ff2c prefixlen 64 scopeid 0x20<link> > ether 0c:c4:7a:ea:ff:2c txqueuelen 1000 (Ethernet) > RX packets 466774 bytes 385578454 (367.7 MiB) > RX errors 4 dropped 223 overruns 0 frame 3 > TX packets 594975 bytes 580053745 (553.1 MiB) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > > > On Mon, Nov 20, 2017 at 2:13 PM, Sébastien VIGNERON <sebastien.vigneron@ > criann.fr> wrote: > >> As a jumbo frame test, can you try the following? >> >> ping -M do -s 8972 -c 4 IP_of_other_node_within_cluster_network >> >> If you have « ping: sendto: Message too long », jumbo frames are not >> activated. >> >> Cordialement / Best regards, >> >> Sébastien VIGNERON >> CRIANN, >> Ingénieur / Engineer >> Technopôle du Madrillet >> 745, avenue de l'Université >> <https://maps.google.com/?q=745,+avenue+de+l%27Universit%C3%A9%C2%A0+76800+Saint-Etienne+du+Rouvray+-+France&entry=gmail&source=g> >> >> 76800 Saint-Etienne du Rouvray - France >> <https://maps.google.com/?q=745,+avenue+de+l%27Universit%C3%A9%C2%A0+76800+Saint-Etienne+du+Rouvray+-+France&entry=gmail&source=g> >> >> tél. +33 2 32 91 42 91 <+33%202%2032%2091%2042%2091> >> fax. +33 2 32 91 42 92 <+33%202%2032%2091%2042%2092> >> http://www.criann.fr >> mailto:[email protected] <[email protected]> >> support: [email protected] >> >> Le 20 nov. 2017 à 13:02, Rudi Ahlers <[email protected]> a écrit : >> >> We're planning on installing 12X Virtual Machines with some heavy loads. >> >> the SSD drives are INTEL SSDSC2BA400G4 >> >> The SATA drives are ST8000NM0055-1RM112 >> >> Please explain your comment, "b) will find a lot of people here who >> don't approve of it." >> >> I don't have access to the switches right now, but they're new so >> whatever default config ships from factory would be active. Though iperf >> shows 10.5 GBytes / 9.02 Gbits/sec throughput. >> >> What speeds would you expect? >> "Though with your setup I would have expected something faster, but NOT >> the >> theoretical 600MB/s 4 HDDs will do in sequential writes." >> >> >> >> On this, "If an OSD has no fast WAL/DB, it will drag the overall speed >> down. Verify and if so fix this and re-test.": how? >> >> >> On Mon, Nov 20, 2017 at 1:44 PM, Christian Balzer <[email protected]> wrote: >> >>> On Mon, 20 Nov 2017 12:38:55 +0200 Rudi Ahlers wrote: >>> >>> > Hi, >>> > >>> > Can someone please help me, how do I improve performance on ou CEPH >>> cluster? >>> > >>> > The hardware in use are as follows: >>> > 3x SuperMicro servers with the following configuration >>> > 12Core Dual XEON 2.2Ghz >>> Faster cores is better for Ceph, IMNSHO. >>> Though with main storage on HDDs, this will do. >>> >>> > 128GB RAM >>> Overkill for Ceph but I see something else below... >>> >>> > 2x 400GB Intel DC SSD drives >>> Exact model please. >>> >>> > 4x 8TB Seagate 7200rpm 6Gbps SATA HDD's >>> One hopes that's a non SMR one. >>> Model please. >>> >>> > 1x SuperMicro DOM for Proxmox / Debian OS >>> Ah, Proxmox. >>> I'm personally not averse to converged, high density, multi-role clusters >>> myself, but you: >>> a) need to know what you're doing and >>> b) will find a lot of people here who don't approve of it. >>> >>> I've avoided DOMs so far (non-hotswapable SPOF), even though the SM ones >>> look good on paper with regards to endurance and IOPS. >>> The later being rather important for your monitors. >>> >>> > 4x Port 10Gbe NIC >>> > Cisco 10Gbe switch. >>> > >>> Configuration would be nice for those, LACP? >>> >>> > >>> > root@virt2:~# rados bench -p Data 10 write --no-cleanup >>> > hints = 1 >>> > Maintaining 16 concurrent writes of 4194304 bytes to objects of size >>> > 4194304 for up to 10 seconds or 0 objects >>> >>> rados bench is limited tool and measuring bandwidth is in nearly all >>> the use cases pointless. >>> Latency is where it is at and testing from inside a VM is more relevant >>> than synthetic tests of the storage. >>> But it is a start. >>> >>> > Object prefix: benchmark_data_virt2_39099 >>> > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg >>> > lat(s) >>> > 0 0 0 0 0 0 - >>> > 0 >>> > 1 16 85 69 275.979 276 0.185576 >>> > 0.204146 >>> > 2 16 171 155 309.966 344 0.0625409 >>> > 0.193558 >>> > 3 16 243 227 302.633 288 0.0547129 >>> > 0.19835 >>> > 4 16 330 314 313.965 348 0.0959492 >>> > 0.199825 >>> > 5 16 413 397 317.565 332 0.124908 >>> > 0.196191 >>> > 6 16 494 478 318.633 324 0.1556 >>> > 0.197014 >>> > 7 15 591 576 329.109 392 0.136305 >>> > 0.192192 >>> > 8 16 670 654 326.965 312 0.0703808 >>> > 0.190643 >>> > 9 16 757 741 329.297 348 0.165211 >>> > 0.192183 >>> > 10 16 828 812 324.764 284 0.0935803 >>> > 0.194041 >>> > Total time run: 10.120215 >>> > Total writes made: 829 >>> > Write size: 4194304 >>> > Object size: 4194304 >>> > Bandwidth (MB/sec): 327.661 >>> What part of this surprises you? >>> >>> With a replication of 3, you have effectively the bandwidth of your 2 >>> SSDs >>> (for small writes, not the case here) and the bandwidth of your 4 HDDs >>> available. >>> Given overhead, other inefficiencies and the fact that this is not a >>> sequential write from the HDD perspective, 320MB/s isn't all that bad. >>> Though with your setup I would have expected something faster, but NOT >>> the >>> theoretical 600MB/s 4 HDDs will do in sequential writes. >>> >>> > Stddev Bandwidth: 35.8664 >>> > Max bandwidth (MB/sec): 392 >>> > Min bandwidth (MB/sec): 276 >>> > Average IOPS: 81 >>> > Stddev IOPS: 8 >>> > Max IOPS: 98 >>> > Min IOPS: 69 >>> > Average Latency(s): 0.195191 >>> > Stddev Latency(s): 0.0830062 <083%200062> >>> > Max latency(s): 0.481448 >>> > Min latency(s): 0.0414858 >>> > root@virt2:~# hdparm -I /dev/sda >>> > >>> > >>> > >>> > root@virt2:~# ceph osd tree >>> > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF >>> > -1 72.78290 root default >>> > -3 29.11316 host virt1 >>> > 1 hdd 7.27829 osd.1 up 1.00000 1.00000 >>> > 2 hdd 7.27829 osd.2 up 1.00000 1.00000 >>> > 3 hdd 7.27829 osd.3 up 1.00000 1.00000 >>> > 4 hdd 7.27829 osd.4 up 1.00000 1.00000 >>> > -5 21.83487 host virt2 >>> > 5 hdd 7.27829 osd.5 up 1.00000 1.00000 >>> > 6 hdd 7.27829 osd.6 up 1.00000 1.00000 >>> > 7 hdd 7.27829 osd.7 up 1.00000 1.00000 >>> > -7 21.83487 host virt3 >>> > 8 hdd 7.27829 osd.8 up 1.00000 1.00000 >>> > 9 hdd 7.27829 osd.9 up 1.00000 1.00000 >>> > 10 hdd 7.27829 osd.10 up 1.00000 1.00000 >>> > 0 0 osd.0 down 0 1.00000 >>> > >>> > >>> > root@virt2:~# ceph -s >>> > cluster: >>> > id: 278a2e9c-0578-428f-bd5b-3bb348923c27 >>> > health: HEALTH_OK >>> > >>> > services: >>> > mon: 3 daemons, quorum virt1,virt2,virt3 >>> > mgr: virt1(active) >>> > osd: 11 osds: 10 up, 10 in >>> > >>> > data: >>> > pools: 1 pools, 512 pgs >>> > objects: 6084 objects, 24105 MB >>> > usage: 92822 MB used, 74438 GB / 74529 GB avail >>> > pgs: 512 active+clean >>> > >>> > root@virt2:~# ceph -w >>> > cluster: >>> > id: 278a2e9c-0578-428f-bd5b-3bb348923c27 >>> > health: HEALTH_OK >>> > >>> > services: >>> > mon: 3 daemons, quorum virt1,virt2,virt3 >>> > mgr: virt1(active) >>> > osd: 11 osds: 10 up, 10 in >>> > >>> > data: >>> > pools: 1 pools, 512 pgs >>> > objects: 6084 objects, 24105 MB >>> > usage: 92822 MB used, 74438 GB / 74529 GB avail >>> > pgs: 512 active+clean >>> > >>> > >>> > 2017-11-20 12:32:08.199450 mon.virt1 [INF] mon.1 10.10.10.82:6789/0 >>> > >>> > >>> > >>> > The SSD drives are used as journal drives: >>> > >>> Bluestore has no journals, don't confuse it and the people you're asking >>> for help. >>> >>> > root@virt3:~# ceph-disk list | grep /dev/sde | grep osd >>> > /dev/sdb1 ceph data, active, cluster ceph, osd.8, block /dev/sdb2, >>> > block.db /dev/sde1 >>> > root@virt3:~# ceph-disk list | grep /dev/sdf | grep osd >>> > /dev/sdc1 ceph data, active, cluster ceph, osd.9, block /dev/sdc2, >>> > block.db /dev/sdf1 >>> > /dev/sdd1 ceph data, active, cluster ceph, osd.10, block /dev/sdd2, >>> > block.db /dev/sdf2 >>> > >>> > >>> > >>> > I see now /dev/sda doesn't have a journal, though it should have. Not >>> sure >>> > why. >>> If an OSD has no fast WAL/DB, it will drag the overall speed down. >>> >>> Verify and if so fix this and re-test. >>> >>> Christian >>> >>> > This is the command I used to create it: >>> > >>> > >>> > pveceph createosd /dev/sda -bluestore 1 -journal_dev /dev/sde >>> > >>> > >>> >>> >>> -- >>> Christian Balzer Network/Systems Engineer >>> [email protected] Rakuten Communications >>> >> >> >> >> -- >> Kind Regards >> Rudi Ahlers >> Website: http://www.rudiahlers.co.za >> _______________________________________________ >> ceph-users mailing list >> [email protected] >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> > > > -- > Kind Regards > Rudi Ahlers > Website: http://www.rudiahlers.co.za > > > -- Kind Regards Rudi Ahlers Website: http://www.rudiahlers.co.za
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
