Am 01.02.18 um 13:34 schrieb Lars Ellenberg: > On Tue, Jan 23, 2018 at 07:14:13PM +0100, Andreas Pflug wrote: >> Am 15.01.18 um 16:37 schrieb Andreas Pflug: >>> Am 09.01.18 um 16:24 schrieb Lars Ellenberg: >>>> On Tue, Jan 09, 2018 at 03:36:34PM +0100, Lars Ellenberg wrote: >>>>> On Mon, Dec 25, 2017 at 03:19:42PM +0100, Andreas Pflug wrote: >>>>>> Running two Debian 9.3 machines, directly connected via 10GBit on-board >>>>>> >>>>>> X540 10GBit, with 15 drbd devices. >>>>>> >>>>>> When running a 4.14.2 kernel (from sid) or a 4.13.13 kernel (from >>>>>> stretch-backports), I see several "Wrong magic value 0x4c414245 in >>>>>> protocol version 101" per day issued by the secondary, with subsequent >>>>>> termination of the connection, reconnect and resync. The magic value >>>>>> logged differs, quite often 0x00. >>>>>> >>>>>> Using the current 4.9.65 kernel (or older) from stretch didn't show >>>>>> these aborts in the past, and after going back they're gone again. It >>>>>> seems to be some problem introduced after 4.9 kernels, since both 4.9 >>>>>> and 4.13 include drbd 8.4.7. Maybe some interference with the nic driver? >>>>>> >>>>>> Kernel drbd ixgbe errors >>>>>> 4.9.65 8.4.7 4.4.0-k no >>>>>> 4.13.13 8.4.7 5.1.0-k yes >>>>>> 4.14.2 8.4.10 5.1.0-k yes >>>>> "strange". >>>>> >>>>> What does "lsblk -D" and "lsblk -t" say? >>>>> >>>>> Do you have a scratch volume you can play with? >>>>> As a datapoint, you try to "blkdiscard /dev/drbdX" it? >>>>> dd if=/dev/zero of=/dev/drbdX bs=1G oflag=direct count=1? >>>>> >>>>> Something like that? >>>>> Any "easy" reproducer? >>>> Maybe while preparing the pull requests for upstream, >>>> we missed/mangled/broke something. >>>> >>>> Can you also reproduce with "out-of-tree" drbd 8.4.10? >>>> >>> So I have currently kernel 4.9.65 with drbd 8.3.7 on the primary server, >>> with the second server (4.14.7 with drbd 8.3.11-rc1) having all drbd >>> devices secondary. >>> >>> Llogged in kern.log on the secondary: >>> Jan 15 15:13:22 xen2 kernel: [451977.741177] drbd monitor.opt: Wrong >>> magic value 0x64656772 in protocol version 101 >> Any news on this issue, anything to test? >> Still getting that message 20 times a day, system not really busy. > Nothing I can make any sense of, yet. > And as of now, afaics, you are "the only one" reporting this. > Can be a lot of things. > > Maybe you can setup a tcpdump capture in ringbuffer mode, > wait for this to happen (watching the kernel log), > and make me the pcap containing the event available somehow? > > something like this (please double check the man page yourself): > tcpdump -s 0 -i $NIC -w drbd.pcap. -W 100 -C 1 [possible port filter here] > (keep in mind that pcap will contain raw block device data, > which you may not want to show to "the internet"). >
After the tcpdump analysis showed that the problem must be located below DRBD, I played around with eth settings. Cutting down the former MTU of 9710 to default 1500 did fix the problem, as well as disabling scatter-gather. So apparently big MTU and scatter-gather don't play nicely on later kernels (or the updated nic driver) I posted a kernel bug on this, https://bugzilla.kernel.org/show_bug.cgi?id=198723 Thanks for your help! Regards Andreas _______________________________________________ drbd-user mailing list [email protected] http://lists.linbit.com/mailman/listinfo/drbd-user
