13.11.2013 06:10, Jefferson Ogata wrote: > Here's a problem i don't understand, and i'd like a solution to if > possible, or at least i'd like to understand why it's a problem, because > i'm clearly not getting something. > > I have an iSCSI target cluster using CentOS 6.4 with stock > pacemaker/CMAN/corosync and tgt, and DRBD 8.4 which i've built from source. > > Both DRBD and cluster comms use a dedicated crossover link. > > The target storage is battery-backed RAID. > > DRBD resources all use protocol C. > > stonith is configured and working. > > tgtd write cache is disabled using mode_page in additional_params. This > is correctly reported using sdparm --get WCE on initiators. > > Here's the question: if i am writing from an iSCSI initiator, and i take > down the crossover link between the nodes of my cluster, i end up with > corrupt data on the target disk. > > I know this isn't the formal way to test pacemaker failover. > Everything's fine if i fence a node or do a manual migration or > shutdown. But i don't understand why taking the crossover down results > in corrupted write operations. > > In greater detail, assuming the initiator sends a write request for some > block, here's the normal sequence as i understand it: > > - tgtd receives it and queues it straight for the device backing the LUN > (write cache is disabled). > - drbd receives it, commits it to disk, sends it to the other node, and > waits for an acknowledgement (protocol C). > - the remote node receives it, commits it to disk, and sends an > acknowledgement. > - the initial node receives the drbd acknowledgement, and acknowledges > the write to tgtd. > - tgtd acknowledges the write to the initiator. > > Now, suppose an initiator is writing when i take the crossover link > down, and pacemaker reacts to the loss in comms by fencing the node with > the currently active target. It then brings up the target on the > surviving, formerly inactive, node. This results in a drbd split brain, > since some writes have been queued on the fenced node but never made it > to the surviving node, and must be retransmitted by the initiator; once > the surviving node becomes active it starts committing these writes to > its copy of the mirror. I'm fine with a split brain; i can resolve it by > discarding outstanding data on the fenced node. > > But in practice, the actual written data is lost, and i don't understand > why. AFAICS, none of the outstanding writes should have been > acknowledged by tgtd on the fenced node, so when the surviving node > becomes active, the initiator should simply re-send all of them. But > this isn't what happens; instead most of the outstanding writes are > lost. No i/o error is reported on the initiator; stuff just vanishes. > > I'm writing directly to a block device for these tests, so the lost data > isn't the result of filesystem corruption; it simply never gets written > to the target disk on the survivor. > > What am i missing?
Do you have handlers (fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";) configured in drbd.conf? _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems