Greetings,
We have 2 mysql clusters set up with DRBD. Both are containing RHEL 5 systems
(2.6.18-164.15.1.el5 #1 SMP) running DRBD 8.3.7, and both are set up with
Dolphin DXH510 cluster interconnects for the DRBD traffic using super sockets.
Things have been running fine for months until yesterday when on one of the
clusters we found DRBD seemingly in a tight loop or race condition. Basically
the 2 drbd workers on the Primary system were pegged at or near 100% cpu
utilization, there was very little sporadic traffic happening over the
supersockets link and mysql was grinding to a halt, seemingly unable to get its
I/Os to happen with any speed. At the same time, the primary system reported a
normal resource state:
1:r0 Connected Primary/Secondary UpToDate/UpToDate C r---- lvm-pv:
replicated_db_log_vg 68.33G 68.33G
2:r1 Connected Primary/Secondary UpToDate/UpToDate C r---- lvm-pv:
replicated_db_data_vg 546.79G 530.00G
I found some kernel messages relating to transient link errors on the Dolphin
DX ports at a frequency of 1 or two every several hours before this, going back
many days. I have requested support from Dolphin with regards to the transient
link errors, which have not re-appeared so far since a reboot. It seems related
to this hang since the start of DRBD troubles coincided with 2 of these error
messages. DRBD did not log anything prior to or during the event until I
started trying to recover by trying to disconnect resources etc. Eventually we
just had to reboot the secondary to get out of this state without risking a
loss of pending mysql transactions.
Surely, we will have to work with Dolphin to figure out the "transient" issue
with the link, but I wanted to check with the DRBD user community with regards
to the hang we experienced. Should not DRBD have been able to recover from
transient link issues by going into standalone mode? Is something in our
configuration preventing a more robust behavior in the case of link problems. I
looked at the release notes for 8.3.8 and I was unsure if the corrected race
conditions and any others mentioned could have any relation to our problem. Any
other thoughts you may have as to this issue would be appreciated.
Here is our config:
global {
usage-count yes;
}
common {
protocol C;
}
resource r0 {
syncer {
rate 900M;
cpu-mask 3;
}
device /dev/drbd1;
handlers {
outdate-peer "/usr/lib64/heartbeat/drbd-peer-outdater";
}
disk { on-io-error detach;
no-disk-barrier;
no-disk-flushes;
no-md-flushes;
fencing resource-only;
}
on host1 {
address sci 192.168.106.1:7789;
meta-disk internal;
disk /dev/db_log_vg/db_log_lv;
}
on host2 {
address sci 192.168.106.2:7789;
meta-disk internal;
disk /dev/db_log_vg/db_log_lv;
}
}
resource r1 {
syncer {
rate 900M;
cpu-mask 3;
}
device /dev/drbd2;
handlers {
outdate-peer "/usr/lib64/heartbeat/drbd-peer-outdater";
}
disk { on-io-error detach;
no-disk-barrier;
no-disk-flushes;
no-md-flushes;
fencing resource-only;
}
on host1 {
address sci 192.168.106.1:7790;
meta-disk internal;
disk /dev/db_data_vg/db_data_lv;
}
on host2 {
address sci 192.168.106.2:7790;
meta-disk internal;
disk /dev/db_data_vg/db_data_lv;
}
}
Thanks,
Sean Foley
This Churchill Downs Incorporated communication (including any attachments) is
intended for the use of the intended recipient(s) only and may contain
information that is confidential, privileged or legally protected. Any
unauthorized use or dissemination of this communication is strictly prohibited.
If you have received this communication in error, please immediately notify the
sender by return e-mail message and delete all copies of the original
communication. Thank you for your cooperation.
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user