Re: [Linux-HA] iSCSI corruption during interconnect failure with pacemaker+tgt+drbd+protocol C

Vladislav Bogdanov Tue, 12 Nov 2013 21:47:58 -0800

13.11.2013 06:10, Jefferson Ogata wrote:
> Here's a problem i don't understand, and i'd like a solution to if
> possible, or at least i'd like to understand why it's a problem, because
> i'm clearly not getting something.
> 
> I have an iSCSI target cluster using CentOS 6.4 with stock
> pacemaker/CMAN/corosync and tgt, and DRBD 8.4 which i've built from source.
> 
> Both DRBD and cluster comms use a dedicated crossover link.
> 
> The target storage is battery-backed RAID.
> 
> DRBD resources all use protocol C.
> 
> stonith is configured and working.
> 
> tgtd write cache is disabled using mode_page in additional_params. This
> is correctly reported using sdparm --get WCE on initiators.
> 
> Here's the question: if i am writing from an iSCSI initiator, and i take
> down the crossover link between the nodes of my cluster, i end up with
> corrupt data on the target disk.
> 
> I know this isn't the formal way to test pacemaker failover.
> Everything's fine if i fence a node or do a manual migration or
> shutdown. But i don't understand why taking the crossover down results
> in corrupted write operations.
> 
> In greater detail, assuming the initiator sends a write request for some
> block, here's the normal sequence as i understand it:
> 
> - tgtd receives it and queues it straight for the device backing the LUN
> (write cache is disabled).
> - drbd receives it, commits it to disk, sends it to the other node, and
> waits for an acknowledgement (protocol C).
> - the remote node receives it, commits it to disk, and sends an
> acknowledgement.
> - the initial node receives the drbd acknowledgement, and acknowledges
> the write to tgtd.
> - tgtd acknowledges the write to the initiator.
> 
> Now, suppose an initiator is writing when i take the crossover link
> down, and pacemaker reacts to the loss in comms by fencing the node with
> the currently active target. It then brings up the target on the
> surviving, formerly inactive, node. This results in a drbd split brain,
> since some writes have been queued on the fenced node but never made it
> to the surviving node, and must be retransmitted by the initiator; once
> the surviving node becomes active it starts committing these writes to
> its copy of the mirror. I'm fine with a split brain; i can resolve it by
> discarding outstanding data on the fenced node.
> 
> But in practice, the actual written data is lost, and i don't understand
> why. AFAICS, none of the outstanding writes should have been
> acknowledged by tgtd on the fenced node, so when the surviving node
> becomes active, the initiator should simply re-send all of them. But
> this isn't what happens; instead most of the outstanding writes are
> lost. No i/o error is reported on the initiator; stuff just vanishes.
> 
> I'm writing directly to a block device for these tests, so the lost data
> isn't the result of filesystem corruption; it simply never gets written
> to the target disk on the survivor.
> 
> What am i missing?


Do you have handlers (fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";) configured in
drbd.conf?

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] iSCSI corruption during interconnect failure with pacemaker+tgt+drbd+protocol C

Reply via email to