Thanks again Felix,
> > common {
> > protocol A;
> >
> > handlers {
> > pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh";
> > pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh";
> > echo o > /proc/sysrq-trigger ; halt -f";
>
> The above looks..."funny" to me. What's wrong here? Copy/Paste error?
>
> Did you modify any notify-* scripts?
Ah, I see what you mean; just a cut paste error I missed (apologies, a stupid
mistake); it should have read...
common {
protocol A;
handlers {
pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh";
pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh";
local-io-error "/usr/lib/drbd/notify-io-error.sh";
split-brain "/usr/lib/drbd/notify-split-brain.sh root";
out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
}
I think from memory, that I hashed the original line in the default global
config ...
#local-io-error "/usr/lib/drbd/notify-io-error.sh;
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt
-f";
and replaced it with the line as seen above ...
local-io-error "/usr/lib/drbd/notify-io-error.sh";
I didn't want any situations where an extreme load induced io-error would
generate an emergency shutdown, as it's a virtualization server.
I did want to be notified though.
date stamps on the notify-* scripts are all uniform (predating the system
build) & I don't recall modifying them at all.
>From the logs, I'm curious about the lines...
Jan 23 15:07:16 emlsurit-v4 kernel: [ 15.044910] block drbd9: 0 KB (0 bits)
marked out-of-sync by on disk bit-map.
Jan 23 15:07:16 emlsurit-v4 kernel: [ 15.044929] block drbd9: Marked
additional 508 MB as out-of-sync based on AL.
...then a little further down
Jan 23 15:53:01 emlsurit-v4 kernel: [ 2756.121108] block drbd9: role( Secondary
-> Primary )
These errors where only seen on the rebooted node that was primary. The log
entries for the two nodes where ostensibly the same other than this.
This node was always primary and the KVM virtual machine running off it, does
not even exist on the other node; yet it has reversed the resource roles
(primary vs secondary).
The nodes of the resource where in a disconnected state prior to the reboot of
the primary node.
The secondary (disconnected) node remained on and the there is no HA setup
associated with either node on any resource.
I did note a clock skew of 3 minutes between the nodes, due to an incorrect ntp
source.
On both nodes, I also noticed ... block drbd9: helper command: /sbin/drbdadm
split-brain minor-9 exit code 127 (0x7f00)
Somehow the (508 Mb?) data has rolled back, & while I'm sad I've likely lost
the data, I can't afford to release this system to production until I'm content
it won't happen again.
The userland tools are ...
drbdadm --version
DRBDADM_BUILDTAG=GIT-hash:\ ea9e28dbff98e331a62bcbcc63a6135808fe2917\ build\
by\ buildd@yellow\,\ 2010-06-01\ 11:06:12
DRBDADM_API_VERSION=88
DRBD_KERNEL_VERSION_CODE=0x080307
DRBDADM_VERSION_CODE=0x080307
DRBDADM_VERSION=8.3.7
Any assistance to help me dig a little deeper here, will be greatly appreciated.
Cheers,
Lew
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user