Your message dated Mon, 15 Jun 2020 08:59:03 +0200 with message-id <[email protected]> and subject line Re: Bug#962454: Link failures after upgrade to +deb10u1 has caused the Debian Bug report #962454, regarding Link failures after upgrade to +deb10u1 to be marked as done.
This means that you claim that the problem has been dealt with. If this is not the case it is now your responsibility to reopen the Bug report if necessary, and/or fix the problem forthwith. (NB: If you are a system administrator and have no idea what this message is talking about, this may indicate a serious mail system misconfiguration somewhere. Please contact [email protected] immediately.) -- 962454: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=962454 Debian Bug Tracking System Contact [email protected] with problems
--- Begin Message ---Source: corosync Version: 3.0.1-2+deb10u1 Severity: important Hi, Some weeks ago I upgraded corosync (3.0.1-2 -> 3.0.1-2+deb10u1) and started to notice these messages in my nodes (two node cluster): Jun 2 01:10:13 patty corosync[2346]: [KNET ] link: host: 2 link: 0 is down Jun 2 01:10:13 patty corosync[2346]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1) Jun 2 01:10:14 patty corosync[2346]: [KNET ] rx: host: 2 link: 0 is up Jun 2 01:10:14 patty corosync[2346]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Jun 3 03:11:07 patty corosync[2346]: [KNET ] link: host: 2 link: 1 is down Jun 3 03:11:07 patty corosync[2346]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Jun 3 03:11:08 patty corosync[2346]: [KNET ] rx: host: 2 link: 1 is up Jun 3 03:11:08 patty corosync[2346]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Notice the failure happens on with both links. One of the links is a cross-over cable. The other uses a bond with two interfaces. These errors are more common on one of the nodes that on the other. Some times they match (both nodes log the link failure), but most of the time only one node complains: Jun 4 01:16:23 selma corosync[52890]: [KNET ] link: host: 1 link: 0 is down Jun 4 01:16:23 selma corosync[52890]: [KNET ] host: host: 1 (passive) best link: 1 (pri: 1) Jun 4 01:16:24 selma corosync[52890]: [KNET ] rx: host: 1 link: 0 is up Jun 4 01:16:24 selma corosync[52890]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Jun 4 01:16:55 patty corosync[2346]: [KNET ] link: host: 2 link: 0 is down Jun 4 01:16:55 patty corosync[2346]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1) Jun 4 01:16:56 patty corosync[2346]: [KNET ] rx: host: 2 link: 0 is up Jun 4 01:16:56 patty corosync[2346]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Here's my config: totem { version: 2 cluster_name: web crypto_cipher: none crypto_hash: none interface { linknumber: 0 } interface { linknumber: 1 } } logging { fileline: off to_stderr: yes to_logfile: yes logfile: /var/log/corosync/corosync.log to_syslog: yes debug: off logger_subsys { subsys: QUORUM debug: off } } quorum { provider: corosync_votequorum expected_votes: 2 two_node: 1 } nodelist { node { name: patty nodeid: 1 ring0_addr: 192.168.144.1 ring1_addr: 10.10.1.5 } node { name: selma nodeid: 2 ring0_addr: 192.168.144.2 ring1_addr: 10.10.1.6 } } Any help is appreciated. Thanks, Alberto -- System Information: Debian Release: bullseye/sid APT prefers unstable APT policy: (500, 'unstable'), (500, 'testing'), (500, 'stable'), (1, 'experimental') Architecture: amd64 (x86_64) Foreign Architectures: i386 Kernel: Linux 5.6.0-1-amd64 (SMP w/4 CPU cores) Kernel taint flags: TAINT_FIRMWARE_WORKAROUND Locale: LANG=C.UTF-8, LC_CTYPE=C.UTF-8 (charmap=UTF-8), LANGUAGE= (charmap=UTF-8) Shell: /bin/sh linked to /bin/dash Init: systemd (via /run/systemd/system)
--- End Message ---
--- Begin Message ---On Thu, Jun 11, 2020 at 10:59:06AM +0200, Valentin Vidic wrote: > On Mon, Jun 08, 2020 at 12:29:35PM +0200, Alberto Gonzalez Iniesta wrote: > > Some weeks ago I upgraded corosync (3.0.1-2 -> 3.0.1-2+deb10u1) and > > started to notice these messages in my nodes (two node cluster): > > Jun 2 01:10:13 patty corosync[2346]: [KNET ] link: host: 2 link: 0 is > > down > > Jun 2 01:10:13 patty corosync[2346]: [KNET ] host: host: 2 (passive) > > best link: 1 (pri: 1) > > Jun 2 01:10:14 patty corosync[2346]: [KNET ] rx: host: 2 link: 0 is up > > Jun 2 01:10:14 patty corosync[2346]: [KNET ] host: host: 2 (passive) > > best link: 0 (pri: 1) > > Jun 3 03:11:07 patty corosync[2346]: [KNET ] link: host: 2 link: 1 is > > down > > Jun 3 03:11:07 patty corosync[2346]: [KNET ] host: host: 2 (passive) > > best link: 0 (pri: 1) > > Jun 3 03:11:08 patty corosync[2346]: [KNET ] rx: host: 2 link: 1 is up > > Jun 3 03:11:08 patty corosync[2346]: [KNET ] host: host: 2 (passive) > > best link: 0 (pri: 1) > > Hi, can you confirm that downgrading to the previous version solves the > link problem for you? > Hi, so downgrading did not fix this, but worsened the situation: Jun 15 07:10:34 selma corosync[21723]: [TOTEM ] Token has not been received in 750 ms Jun 15 07:10:35 selma corosync[21723]: [TOTEM ] A processor failed, forming new configuration. Jun 15 07:10:35 selma corosync[21723]: [TOTEM ] A new membership (1:72) was formed. Members Jun 15 07:10:35 selma corosync[21723]: [CPG ] downlist left_list: 0 received Jun 15 07:10:35 selma corosync[21723]: [CPG ] downlist left_list: 0 received Jun 15 07:10:35 selma corosync[21723]: [QUORUM] Members[2]: 1 2 Jun 15 07:10:35 selma corosync[21723]: [MAIN ] Completed service synchronization, ready to provide service. --------- Jun 15 07:10:34 patty corosync[15095]: [KNET ] link: host: 2 link: 0 is down Jun 15 07:10:34 patty corosync[15095]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Jun 15 07:10:34 patty corosync[15095]: [KNET ] host: host: 2 has no active links Jun 15 07:10:34 patty corosync[15095]: [TOTEM ] Token has not been received in 36 ms Jun 15 07:10:35 patty corosync[15095]: [KNET ] rx: host: 2 link: 0 is up Jun 15 07:10:35 patty corosync[15095]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Jun 15 07:10:35 patty corosync[15095]: [TOTEM ] A new membership (1:72) was formed. Members Jun 15 07:10:35 patty corosync[15095]: [CPG ] downlist left_list: 0 received Jun 15 07:10:35 patty corosync[15095]: [CPG ] downlist left_list: 0 received Jun 15 07:10:35 patty corosync[15095]: [QUORUM] Members[2]: 1 2 Jun 15 07:10:35 patty corosync[15095]: [MAIN ] Completed service synchronization, ready to provide service. ----------- So I guess I'll have to look somewhere else (but I'm quite sure this started after the upgrade to +deb10u1)... Anyway, thanks and sorry for the noise. -- Alberto Gonzalez Iniesta | Formación, consultoría y soporte técnico mailto/sip: [email protected] | en GNU/Linux y software libre Encrypted mail preferred | http://inittab.com Key fingerprint = 5347 CBD8 3E30 A9EB 4D7D 4BF2 009B 3375 6B9A AA55
--- End Message ---

