[Linux-HA] iSCSI corruption during interconnect failure with pacemaker+tgt+drbd+protocol C

Jefferson Ogata Tue, 12 Nov 2013 19:11:33 -0800

Here's a problem i don't understand, and i'd like a solution to ifpossible, or at least i'd like to understand why it's a problem, becausei'm clearly not getting something.

I have an iSCSI target cluster using CentOS 6.4 with stockpacemaker/CMAN/corosync and tgt, and DRBD 8.4 which i've built from source.


Both DRBD and cluster comms use a dedicated crossover link.

The target storage is battery-backed RAID.

DRBD resources all use protocol C.

stonith is configured and working.

tgtd write cache is disabled using mode_page in additional_params. Thisis correctly reported using sdparm --get WCE on initiators.

Here's the question: if i am writing from an iSCSI initiator, and i takedown the crossover link between the nodes of my cluster, i end up withcorrupt data on the target disk.

I know this isn't the formal way to test pacemaker failover.Everything's fine if i fence a node or do a manual migration orshutdown. But i don't understand why taking the crossover down resultsin corrupted write operations.

In greater detail, assuming the initiator sends a write request for someblock, here's the normal sequence as i understand it:

- tgtd receives it and queues it straight for the device backing the LUN(write cache is disabled).- drbd receives it, commits it to disk, sends it to the other node, andwaits for an acknowledgement (protocol C).- the remote node receives it, commits it to disk, and sends anacknowledgement.- the initial node receives the drbd acknowledgement, and acknowledgesthe write to tgtd.

- tgtd acknowledges the write to the initiator.

Now, suppose an initiator is writing when i take the crossover linkdown, and pacemaker reacts to the loss in comms by fencing the node withthe currently active target. It then brings up the target on thesurviving, formerly inactive, node. This results in a drbd split brain,since some writes have been queued on the fenced node but never made itto the surviving node, and must be retransmitted by the initiator; oncethe surviving node becomes active it starts committing these writes toits copy of the mirror. I'm fine with a split brain; i can resolve it bydiscarding outstanding data on the fenced node.

But in practice, the actual written data is lost, and i don't understandwhy. AFAICS, none of the outstanding writes should have beenacknowledged by tgtd on the fenced node, so when the surviving nodebecomes active, the initiator should simply re-send all of them. Butthis isn't what happens; instead most of the outstanding writes arelost. No i/o error is reported on the initiator; stuff just vanishes.

I'm writing directly to a block device for these tests, so the lost dataisn't the result of filesystem corruption; it simply never gets writtento the target disk on the survivor.


What am i missing?
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] iSCSI corruption during interconnect failure with pacemaker+tgt+drbd+protocol C

Reply via email to