Lars Ellenberg gave some interesting information about this messages - at least if you have verification of your Network traffic enabled:
On Sat, Feb 26, 2011 at 07:31:03PM +0100, Walter Haidinger wrote: > Hi Lars, thanks for the reply. > > > So you no longer have any problems/ASSERTs regarding drbd_al_read_log? > > No, those are gone. I did a create-md on the secondary node and a full resync. Don't know if that was "the fix", though, but I suppose so. > > > Well, what does the other (Primary) side say? > > I'd expect it to say > > "Digest mismatch, buffer modified by upper layers during write: ..." > > Yes, it does (see the kernel logs below). > > > If it does not, your link corrupty data. > > If it does, well, then that's what happens. > > (note: this double check on the sending side > > has only been introduced with 8.3.10) > > Now where do I go from here? > Any way to tell who or what is responsible for the data corruption? There is just "buffers modified during writeout". That's not necessarily the same as data corruption. Quoting the DRBD User's Guide: Notes on data integrity There are two independent methods in DRBD to ensure the integrity of the mirrored data. The online-verify mechanism and the data-integrity-alg of the network section. Both mechanisms might deliver false positives if the user of DRBD modifies the data which gets written to disk while the transfer goes on. This may happen for swap, or for certain append while global sync, or truncate/rewrite workloads, and not necessarily poses a problem for the integrity of the data. Usually when the initiator of the data transfer does this, it already knows that that data block will not be part of an on disk data structure, or will be resubmitted with correct data soon enough. ... If you don't want to know about that, disable that check. If the replication link interruptions caused by that check are bad for your setup (particularly so in dual primary setups), disabled that check. If you want to use it anyways: that's great, do so, and live with it. If you want to have DRBD do "end-to-end" data checksums, even if the data buffers may be modified while being in flight, and still want it to be efficient, sponsor feature development. The Problem: http://lwn.net/Articles/429305/ http://thread.gmane.org/gmane.linux.kernel/1103571 http://thread.gmane.org/gmane.linux.scsi/59259 And many many more older threads on various ML, some of them misleading, some of them mixing this issue of in-flight modifications with actual (hardware caused) data corruption. Possible Solutions: - DRBD starts to first copy every submitted data to some private pages, then calculates the checksum. As this is now a checksum over *private* pages, if it does not match, that's a always a sign of data corruption. It also is a significant performance hit. Potentially, we could optimistically try to get away without copying, and only take the performance hit once we see a mismatch, in which case we'd need to copy it still anyways, and send it again -- if we still have it. - Linux generic write-out path is fixed to not allow modifications of data during write-out. - Linux generic block integrity framework is fixed in whatever way is deemed most useful, and DRBD switches to use that instead, respectively simply forward integrity information, which may already have been generated by some layer above DRBD. The "generic write out path" people seem to be on it, this time. Not sure if it will help much with VMs on top of DRBD, as they will run older kernels or different operating systems doing things differently, potentially screwing things up. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com _______________________________________________ drbd-user mailing list [email protected] http://lists.linbit.com/mailman/listinfo/drbd-user Hope that helps. Mit freundlichen Grüßen / Best Regards Robert Köppl Systemadministration KNAPP Systemintegration GmbH Waltenbachstraße 9 8700 Leoben, Austria Phone: +43 3842 805-910 Fax: +43 3842 82930-500 [email protected] www.KNAPP.com Commercial register number: FN 138870x Commercial register court: Leoben The information in this e-mail (including any attachment) is confidential and intended to be for the use of the addressee(s) only. If you have received the e-mail by mistake, any disclosure, copy, distribution or use of the contents of the e-mail is prohibited, and you must delete the e-mail from your system. As e-mail can be changed electronically KNAPP assumes no responsibility for any alteration to this e-mail or its attachments. KNAPP has taken every reasonable precaution to ensure that any attachment to this e-mail has been swept for virus. However, KNAPP does not accept any liability for damage sustained as a result of such attachment being virus infected and strongly recommend that you carry out your own virus check before opening any attachment. Boris Virc <[email protected]> Gesendet von: [email protected] 28.02.2011 13:27 Bitte antworten an General Linux-HA mailing list <[email protected]> An General Linux-HA mailing list <[email protected]> Kopie Thema Re: [Linux-HA] DRBD BrokenPipe I checked bonding mode for NICs, and it is mode 4. I'm not so familiar with the mode type, so if anyone can tell me what bond mode is most suitable (i'm using direct crossover cable for interconnection). Regards, Boris -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Boris Virc Sent: Monday, February 21, 2011 12:46 PM To: [email protected] Subject: [Linux-HA] DRBD BrokenPipe Hello, I have installed SLES with kernel version 2.6.32.19-0.3 and DRBD 8.3.8.1 (using two nodes - primary-slave). I noticed that there is a lot of BrokenPipe errors in log files: Feb 11 12:59:40 sles1 crm-fence-peer.sh[64879]: invoked for r0 Feb 11 12:59:41 sles1 crm-fence-peer.sh[64879]: INFO peer is reachable, my disk is UpToDate: placed constraint 'drbd-fence-by-handler-ms_drbd' Feb 11 12:59:41 sles1 kernel: [6022113.566198] block drbd0: helper command: /sbin/drbdadm fence-peer minor-0 exit code 4 (0x400) Feb 11 12:59:41 sles1 kernel: [6022113.566206] block drbd0: fence-peer helper returned 4 (peer was fenced) Feb 11 12:59:41 sles1 kernel: [6022113.566228] block drbd0: pdsk( DUnknown -> Outdated ) Feb 11 12:59:41 sles1 kernel: [6022113.566400] block drbd0: conn( BrokenPipe -> Unconnected ) Feb 11 12:59:41 sles1 kernel: [6022113.566418] block drbd0: receiver terminated Feb 11 12:59:41 sles1 kernel: [6022113.566422] block drbd0: Restarting receiver thread Feb 11 12:59:41 sles1 kernel: [6022113.566426] block drbd0: receiver (re)started Feb 11 12:59:41 sles1 kernel: [6022113.566441] block drbd0: conn( Unconnected -> WFConnection ) Feb 11 12:59:41 sles1 pengine: [30521]: notice: unpack_config: On loss of CCM Quorum: Ignore The system works, but within 2 monts, there was already two unpredictable error (we had to restart secondary server so that primary started to work again). Is there anything that we can do to avoid those errors ? Regards, Boris _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
