are all 3 nics in the same bond together? I don't think bonding nics of various speeds is a great idea,
How are you separating the ceph traffic into the individual nics? On Fri, Jul 6, 2018 at 7:04 AM, John Spray <[email protected]> wrote: > On Fri, Jul 6, 2018 at 11:10 AM Marcus Haarmann > <[email protected]> wrote: > > > > Hi experts, > > > > we have setup a proxmox cluster with a minimal environment for some > testing. > > We have put some VMs on the cluster and encountered mon quorum problems > > while backups are executed. (possibly polluting either hard disk I/O or > network I/O) > > Setup: > > 4 Machines with Proxmox 5.2-2 (Ceph 12.2.5 luminous) > > 3 ceph mons > > 8 osd (2 per machine, each 2TB disk space, usage 25%), with bluestore > > 3 bond NIC (balance-alb) active (1GBit for proxmox machine access, one > (10GBit) for ceph public and one (10GBit) for ceph cluster > > > > Ceph config as follows > > > > [global] > > auth client required = cephx > > auth cluster required = cephx > > auth service required = cephx > > cluster network = 192.168.17.0/24 > > fsid = 5070e036-8f6c-4795-a34d-9035472a628d > > keyring = /etc/pve/priv/$cluster.$name.keyring > > mon allow pool delete = true > > osd journal size = 5120 > > osd pool default min size = 2 > > osd pool default size = 3 > > public network = 192.168.16.0/24 > > > > [osd] > > keyring = /var/lib/ceph/osd/ceph-$id/keyring > > > > [mon.ariel2] > > host = ariel2 > > mon addr = 192.168.16.32:6789 > > > > [mon.ariel1] > > host = ariel1 > > mon addr = 192.168.16.31:6789 > > [mon.ariel4] > > host = ariel4 > > mon addr = 192.168.16.34:6789 > > > > [osd.0] > > public addr = 192.168.16.32 > > cluster addr = 192.168.17.32 > > > > [osd.1] > > public addr = 192.168.16.34 > > cluster addr = 192.168.17.34 > > > > [osd.2] > > public addr = 192.168.16.31 > > cluster addr = 192.168.17.31 > > > > [osd.3] > > public addr = 192.168.16.31 > > cluster addr = 192.168.17.31 > > > > [osd.4] > > public addr = 192.168.16.32 > > cluster addr = 192.168.17.32 > > > > [osd.5] > > public addr = 192.168.16.34 > > cluster addr = 192.168.17.34 > > [osd.6] > > public addr = 192.168.16.33 > > cluster addr = 192.168.17.33 > > [osd.7] > > public addr = 192.168.16.33 > > cluster addr = 192.168.17.33 > > > > Everything is running smoothly until a backup is taken: > > (from machine 2) > > 2018-07-06 02:47:54.691483 mon.ariel4 mon.2 192.168.16.34:6789/0 30663 > : cluster [INF] mon.ariel4 calling monitor election > > 2018-07-06 02:47:54.754901 mon.ariel2 mon.1 192.168.16.32:6789/0 29602 > : cluster [INF] mon.ariel2 calling monitor election > > 2018-07-06 02:47:59.934534 mon.ariel2 mon.1 192.168.16.32:6789/0 29603 > : cluster [INF] mon.ariel2 is new leader, mons ariel2,ariel4 in quorum > (ranks 1,2) > > 2018-07-06 02:48:00.056711 mon.ariel2 mon.1 192.168.16.32:6789/0 29608 > : cluster [WRN] Health check failed: 1/3 mons down, quorum ariel2,ariel4 > (MON_DOWN) > > 2018-07-06 02:48:00.133880 mon.ariel2 mon.1 192.168.16.32:6789/0 29610 > : cluster [WRN] overall HEALTH_WARN 1/3 mons down, quorum ariel2,ariel4 > > 2018-07-06 02:48:09.480385 mon.ariel1 mon.0 192.168.16.31:6789/0 33856 > : cluster [INF] mon.ariel1 calling monitor election > > 2018-07-06 02:48:09.635420 mon.ariel4 mon.2 192.168.16.34:6789/0 30666 > : cluster [INF] mon.ariel4 calling monitor election > > 2018-07-06 02:48:09.635729 mon.ariel2 mon.1 192.168.16.32:6789/0 29613 > : cluster [INF] mon.ariel2 calling monitor election > > 2018-07-06 02:48:09.723634 mon.ariel1 mon.0 192.168.16.31:6789/0 33857 > : cluster [INF] mon.ariel1 calling monitor election > > 2018-07-06 02:48:10.059104 mon.ariel1 mon.0 192.168.16.31:6789/0 33858 > : cluster [INF] mon.ariel1 is new leader, mons ariel1,ariel2,ariel4 in > quorum (ranks 0,1,2) > > 2018-07-06 02:48:10.587894 mon.ariel1 mon.0 192.168.16.31:6789/0 33863 > : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum > ariel2,ariel4) > > 2018-07-06 02:48:10.587910 mon.ariel1 mon.0 192.168.16.31:6789/0 33864 > : cluster [INF] Cluster is now healthy > > 2018-07-06 02:48:22.038196 mon.ariel4 mon.2 192.168.16.34:6789/0 30668 > : cluster [INF] mon.ariel4 calling monitor election > > 2018-07-06 02:48:22.078876 mon.ariel2 mon.1 192.168.16.32:6789/0 29615 > : cluster [INF] mon.ariel2 calling monitor election > > 2018-07-06 02:48:27.197263 mon.ariel2 mon.1 192.168.16.32:6789/0 29616 > : cluster [INF] mon.ariel2 is new leader, mons ariel2,ariel4 in quorum > (ranks 1,2) > > 2018-07-06 02:48:27.237330 mon.ariel2 mon.1 192.168.16.32:6789/0 29621 > : cluster [WRN] Health check failed: 1/3 mons down, quorum ariel2,ariel4 > (MON_DOWN) > > 2018-07-06 02:48:27.357095 mon.ariel2 mon.1 192.168.16.32:6789/0 29622 > : cluster [WRN] overall HEALTH_WARN 1/3 mons down, quorum ariel2,ariel4 > > 2018-07-06 02:48:32.456742 mon.ariel1 mon.0 192.168.16.31:6789/0 33867 > : cluster [INF] mon.ariel1 calling monitor election > > 2018-07-06 02:48:33.011025 mon.ariel1 mon.0 192.168.16.31:6789/0 33868 > : cluster [INF] mon.ariel1 is new leader, mons ariel1,ariel2,ariel4 in > quorum (ranks 0,1,2) > > 2018-07-06 02:48:33.967501 mon.ariel1 mon.0 192.168.16.31:6789/0 33873 > : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum > ariel2,ariel4) > > 2018-07-06 02:48:33.967523 mon.ariel1 mon.0 192.168.16.31:6789/0 33874 > : cluster [INF] Cluster is now healthy > > 2018-07-06 02:48:35.002941 mon.ariel1 mon.0 192.168.16.31:6789/0 33875 > : cluster [INF] overall HEALTH_OK > > 2018-07-06 02:49:11.927388 mon.ariel4 mon.2 192.168.16.34:6789/0 30675 > : cluster [INF] mon.ariel4 calling monitor election > > 2018-07-06 02:49:12.001371 mon.ariel2 mon.1 192.168.16.32:6789/0 29629 > : cluster [INF] mon.ariel2 calling monitor election > > 2018-07-06 02:49:17.163727 mon.ariel2 mon.1 192.168.16.32:6789/0 29630 > : cluster [INF] mon.ariel2 is new leader, mons ariel2,ariel4 in quorum > (ranks 1,2) > > 2018-07-06 02:49:17.199214 mon.ariel2 mon.1 192.168.16.32:6789/0 29635 > : cluster [WRN] Health check failed: 1/3 mons down, quorum ariel2,ariel4 > (MON_DOWN) > > 2018-07-06 02:49:17.296646 mon.ariel2 mon.1 192.168.16.32:6789/0 29636 > : cluster [WRN] overall HEALTH_WARN 1/3 mons down, quorum ariel2,ariel4 > > 2018-07-06 02:49:47.014202 mon.ariel1 mon.0 192.168.16.31:6789/0 33880 > : cluster [INF] mon.ariel1 calling monitor election > > 2018-07-06 02:49:47.357144 mon.ariel1 mon.0 192.168.16.31:6789/0 33881 > : cluster [INF] mon.ariel1 is new leader, mons ariel1,ariel2,ariel4 in > quorum (ranks 0,1,2) > > 2018-07-06 02:49:47.639535 mon.ariel1 mon.0 192.168.16.31:6789/0 33886 > : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum > ariel2,ariel4) > > 2018-07-06 02:49:47.639553 mon.ariel1 mon.0 192.168.16.31:6789/0 33887 > : cluster [INF] Cluster is now healthy > > 2018-07-06 02:49:47.810993 mon.ariel1 mon.0 192.168.16.31:6789/0 33888 > : cluster [INF] overall HEALTH_OK > > 2018-07-06 02:49:59.349085 mon.ariel4 mon.2 192.168.16.34:6789/0 30681 > : cluster [INF] mon.ariel4 calling monitor election > > 2018-07-06 02:49:59.427457 mon.ariel2 mon.1 192.168.16.32:6789/0 29648 > : cluster [INF] mon.ariel2 calling monitor election > > 2018-07-06 02:50:02.978856 mon.ariel1 mon.0 192.168.16.31:6789/0 33889 > : cluster [INF] mon.ariel1 calling monitor election > > 2018-07-06 02:50:03.299621 mon.ariel1 mon.0 192.168.16.31:6789/0 33890 > : cluster [INF] mon.ariel1 is new leader, mons ariel1,ariel2,ariel4 in > quorum (ranks 0,1,2) > > > > From machine 1: > > 2018-07-06 02:47:08.541710 mon.ariel2 mon.1 192.168.16.32:6789/0 29590 > : cluster [INF] mon.ariel2 calling monitor election > > 2018-07-06 02:47:12.949379 mon.ariel1 mon.0 192.168.16.31:6789/0 33844 > : cluster [INF] mon.ariel1 calling monitor election > > 2018-07-06 02:47:13.929753 mon.ariel1 mon.0 192.168.16.31:6789/0 33845 > : cluster [INF] mon.ariel1 is new leader, mons ariel1,ariel2,ariel4 in > quorum (ranks 0,1,2) > > 2018-07-06 02:47:16.793479 mon.ariel1 mon.0 192.168.16.31:6789/0 33850 > : cluster [INF] overall HEALTH_OK > > 2018-07-06 02:48:09.480385 mon.ariel1 mon.0 192.168.16.31:6789/0 33856 > : cluster [INF] mon.ariel1 calling monitor election > > 2018-07-06 02:48:09.635420 mon.ariel4 mon.2 192.168.16.34:6789/0 30666 > : cluster [INF] mon.ariel4 calling monitor election > > 2018-07-06 02:48:09.635729 mon.ariel2 mon.1 192.168.16.32:6789/0 29613 > : cluster [INF] mon.ariel2 calling monitor election > > 2018-07-06 02:48:09.723634 mon.ariel1 mon.0 192.168.16.31:6789/0 33857 > : cluster [INF] mon.ariel1 calling monitor election > > 2018-07-06 02:48:10.059104 mon.ariel1 mon.0 192.168.16.31:6789/0 33858 > : cluster [INF] mon.ariel1 is new leader, mons ariel1,ariel2,ariel4 in > quorum (ranks 0,1,2) > > 2018-07-06 02:48:10.587894 mon.ariel1 mon.0 192.168.16.31:6789/0 33863 > : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum > ariel2,ariel4) > > 2018-07-06 02:48:10.587910 mon.ariel1 mon.0 192.168.16.31:6789/0 33864 > : cluster [INF] Cluster is now healthy > > 2018-07-06 02:48:32.456742 mon.ariel1 mon.0 192.168.16.31:6789/0 33867 > : cluster [INF] mon.ariel1 calling monitor election > > 2018-07-06 02:48:33.011025 mon.ariel1 mon.0 192.168.16.31:6789/0 33868 > : cluster [INF] mon.ariel1 is new leader, mons ariel1,ariel2,ariel4 in > quorum (ranks 0,1,2) > > 2018-07-06 02:48:33.967501 mon.ariel1 mon.0 192.168.16.31:6789/0 33873 > : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum > ariel2,ariel4) > > 2018-07-06 02:48:33.967523 mon.ariel1 mon.0 192.168.16.31:6789/0 33874 > : cluster [INF] Cluster is now healthy > > 2018-07-06 02:48:35.002941 mon.ariel1 mon.0 192.168.16.31:6789/0 33875 > : cluster [INF] overall HEALTH_OK > > 2018-07-06 02:49:47.014202 mon.ariel1 mon.0 192.168.16.31:6789/0 33880 > : cluster [INF] mon.ariel1 calling monitor election > > 2018-07-06 02:49:47.357144 mon.ariel1 mon.0 192.168.16.31:6789/0 33881 > : cluster [INF] mon.ariel1 is new leader, mons ariel1,ariel2,ariel4 in > quorum (ranks 0,1,2) > > 2018-07-06 02:49:47.639535 mon.ariel1 mon.0 192.168.16.31:6789/0 33886 > : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum > ariel2,ariel4) > > 2018-07-06 02:49:47.639553 mon.ariel1 mon.0 192.168.16.31:6789/0 33887 > : cluster [INF] Cluster is now healthy > > 2018-07-06 02:49:47.810993 mon.ariel1 mon.0 192.168.16.31:6789/0 33888 > : cluster [INF] overall HEALTH_OK > > 2018-07-06 02:49:59.349085 mon.ariel4 mon.2 192.168.16.34:6789/0 30681 > : cluster [INF] mon.ariel4 calling monitor election > > 2018-07-06 02:49:59.427457 mon.ariel2 mon.1 192.168.16.32:6789/0 29648 > : cluster [INF] mon.ariel2 calling monitor election > > 2018-07-06 02:50:02.978856 mon.ariel1 mon.0 192.168.16.31:6789/0 33889 > : cluster [INF] mon.ariel1 calling monitor election > > 2018-07-06 02:50:03.299621 mon.ariel1 mon.0 192.168.16.31:6789/0 33890 > : cluster [INF] mon.ariel1 is new leader, mons ariel1,ariel2,ariel4 in > quorum (ranks 0,1,2) > > 2018-07-06 02:50:03.642986 mon.ariel1 mon.0 192.168.16.31:6789/0 33895 > : cluster [INF] overall HEALTH_OK > > 2018-07-06 02:50:46.757619 mon.ariel1 mon.0 192.168.16.31:6789/0 33899 > : cluster [INF] mon.ariel1 calling monitor election > > 2018-07-06 02:50:46.920468 mon.ariel1 mon.0 192.168.16.31:6789/0 33900 > : cluster [INF] mon.ariel1 is new leader, mons ariel1,ariel2,ariel4 in > quorum (ranks 0,1,2) > > 2018-07-06 02:50:47.104222 mon.ariel1 mon.0 192.168.16.31:6789/0 33905 > : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum > ariel2,ariel4) > > 2018-07-06 02:50:47.104240 mon.ariel1 mon.0 192.168.16.31:6789/0 33906 > : cluster [INF] Cluster is now healthy > > 2018-07-06 02:50:47.256301 mon.ariel1 mon.0 192.168.16.31:6789/0 33907 > : cluster [INF] overall HEALTH_OK > > > > > > There seems to be some disturbance of mon traffic. > > Since the mons are communicating via a 10GBit interface, I would not > assume a problem here. > > There are no errors logged either on the network interfaces or on the > switches. > > > > Maybe the disks are too slow (osds are on SATA), so we are thinking > about putting the bluestore journal on an SSD. > > But would that action help to stabilize the mons ? > > Probably not -- if anything, giving the OSDs faster storage will > alleviate that bottleneck and enable the OSD daemons to hit the CPU > even harder. > > > Or would a setup with 5 machines (5 mons running) be the better choice ? > > Adding more mons won't help if they're being disrupted by running on > overloaded nodes. > > > So we are a little stuck where to search for a solution. > > What debug output would help to see whether we have a disk or network > problem here ? > > You didn't mention what storage devices the mons are using (i.e. what > device is hosting /var) -- hopefully it's an SSD that isn't loaded > with any other heavy workloads? > > For checking how saturated your CPU and network are, you'd use any of > the standard linux tools (I'm a dstat fan, personally). > > When running mons and OSDs on the same nodes, it is preferred to run > them in containers so that resource utilization can be limited, and > thereby protect the monitors from over-enthusiastic OSD daemons. > Otherwise, you will always have similar issues if any system resource > is saturated. The alternative is to simply over-provision hardware to > the point that contention is not an issue (which is a bit expensive of > course). > > John > > > > > Thankx for your input ! > > > > Marcus Haarmann > > > > _______________________________________________ > > ceph-users mailing list > > [email protected] > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
