Hi Lars,

> Maybe the logs you posted do not match the incident described.

There are no other logs available for this resource, and no additional 
information in they system logs that I'm able to find.

> Or you attached to stale data, thinking a rollback had taken place,
> but actually it is just stale data and the more recent data is still
> on the other node.
>
> But the logs you posted do not show any sync taking place, even cleary
> show that DRBD refuses to do a sync because it detected data
> divergence.
> There cannot have been a rollback, because there has been no sync,
> again according to the logs you posted.

That's correct no sync has taken place & it is still un-synced.

> Go back to your logs, and find the logs that match the incident
> described.
> 
> What is the status of that pair of DRBD now?
> Is it actually "cs:Connected, UpToDate/UpToDate" ?
> 
> Find out when it became so, and how. Because, again, the logs you
> showed previously, state, that DRBD refused to connect.
> If it finnaly synced up and connected anyways, likely someone told it
> to
> "--discard-my-data" on one of the nodes (or "invalidate" or something
> to
> that regard).
> And if that has been the side with the data you lost,
> well, then that someone told DRBD to throw it away.

The resource nodes are still disconnected and no override has been used to 
force the situation.
The only commands issued have been drbdadm connect all, drbdadm connect x2, 
drbdadm primary x2 (on the only node that has ever been primary) and drbd 
attach.
I'm the only one with access to these machines, I can assure you sync has not 
been forced at any time.

The only log record against this resource in all  archived messages prior to 
the system restart is..
Jan 11 11:30:31 emlsurit-v4 kernel: [7745016.672246] block drbd9: disk( 
UpToDate -> Diskless )
I expect this is the point at which the drbdadm detach was issued, while the 
node was primary and active. 
>From the command history I can't determine which node the detach was issued 
>from.
Does it matter which node a drbdadm detach is issued from?

I've attached the logs from each node (since the point of system restart, both 
created using grep drbd /var/log/messages >>
The resource in question is drbd9.

On node A it details system start Jan 23 15:07:16,  
The resource was later set primary before network connection between the 
nodes...
Jan 23 15:53:01 emlsurit-v4 kernel: [ 2756.121108] block drbd9: role( Secondary 
-> Primary ) 
Jan 23 15:53:01 emlsurit-v4 kernel: [ 2756.122546] block drbd9: Creating new 
current UUID
A minute later we can see the KVM instance start up and libvirt access the 
resource...

Jan 23 15:55:06 emlsurit-v4 kernel: [ 2880.172752] type=1503 
audit(1295758506.227:17):  operation="open" pid=8340 parent=1787 
profile="/usr/lib/libvirt/virt-aa-helper" requested_mask="r::" 
denied_mask="r::" fsuid=0 ouid=0 name="/dev/drbd9"

Later that evening the VLAN connectivity is restored and I issue a drbdadm 
connect all
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.806263] block drbd9: conn( 
StandAlone -> Unconnected ) 
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.806312] block drbd9: Starting 
receiver thread (from drbd9_worker [2126])
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.806353] block drbd9: receiver 
(re)started
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.806359] block drbd9: conn( 
Unconnected -> WFConnection )

handshake proceeds and spil-brains...
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.905967] block drbd9: self 
49615ABF1622FC55:643454BA1CA67140:5625CFAB3DDD24A2:EA5079D16F8C7807 bits:143432 
flags:0
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.905971] block drbd9: peer 
6116B0558277E470:643454BA1CA67140:5625CFAB3DDD24A2:EA5079D16F8C7807 bits:336381 
flags:0
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.905975] block drbd9: 
uuid_compare()=100 by rule 90
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.906273] block drbd9: helper command: 
/sbin/drbdadm split-brain minor-9

Perhaps what has been confusing the matter is my initial post associating 
split-brain with the data loss.
The node was primary and active prior to any split brain, and it seems to me 
that the roll back/loss of data had occurred prior to split-brain. 
The only conceivable possibility to me, is still that NodeA has rolled back or 
discarded changes in it's activity log following the restart.
As far as I can determine this occurred prior to the split-brain, while the 
resource nodes where still disconnected (prior to restoration of network 
connectivity).

Just to be thorough, I'll export the KVM instance XML and start it up to 
investigate the other node, but do not expect to find the data that's missing 
there.

Thanks for all the efforts so far. 

Cheers,

Lew



Attachment: drbd_nodeA.gz
Description: GNU Zip compressed data

Attachment: drbd_nodeB.gz
Description: GNU Zip compressed data

_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to