Greetings,

We have 2 mysql clusters set up with DRBD. Both are containing RHEL 5 systems 
(2.6.18-164.15.1.el5 #1 SMP) running DRBD 8.3.7, and both are set up with 
Dolphin DXH510 cluster interconnects for the DRBD traffic using super sockets. 
Things have been running fine for months until yesterday when on one of the 
clusters we found DRBD seemingly in a tight loop or race condition. Basically 
the 2 drbd workers on the Primary system were pegged at or near 100% cpu 
utilization,  there was very little sporadic traffic happening over the 
supersockets link and mysql was grinding to a halt, seemingly unable to get its 
I/Os to happen with any speed. At the same time, the primary system reported a 
normal resource state:

1:r0  Connected Primary/Secondary UpToDate/UpToDate C r---- lvm-pv: 
replicated_db_log_vg  68.33G  68.33G
2:r1  Connected Primary/Secondary UpToDate/UpToDate C r---- lvm-pv: 
replicated_db_data_vg 546.79G 530.00G

I found some kernel messages relating to transient link errors on the Dolphin 
DX ports at a frequency of 1 or two every several hours before this, going back 
many days. I have requested support from Dolphin with regards to the transient 
link errors, which have not re-appeared so far since a reboot. It seems related 
to this hang since the start of DRBD troubles coincided with 2 of these error 
messages. DRBD did not log anything prior to or during the event until I 
started trying to recover by trying to disconnect resources etc. Eventually we 
just had to reboot the secondary to get out of this state without risking a 
loss of pending mysql transactions.

Surely, we will have to work with Dolphin to figure out the "transient" issue 
with the link, but I wanted to check with the DRBD user community with regards 
to the hang we experienced. Should not DRBD have been able to recover from 
transient link issues by going into standalone mode? Is something in our 
configuration preventing a more robust behavior in the case of link problems. I 
looked at the release notes for 8.3.8 and I was unsure if the corrected race 
conditions and any others mentioned could have any relation to our problem. Any 
other thoughts you may have as to this issue would be appreciated.

Here is our config:

global {
  usage-count yes;
}
common {
  protocol C;
}
resource r0 {
  syncer {
    rate 900M;
    cpu-mask 3;
  }

  device    /dev/drbd1;
  handlers {
      outdate-peer "/usr/lib64/heartbeat/drbd-peer-outdater";
  }
  disk { on-io-error detach;
    no-disk-barrier;
    no-disk-flushes;
    no-md-flushes;
    fencing resource-only;
  }
  on host1 {
    address sci 192.168.106.1:7789;
    meta-disk internal;
    disk      /dev/db_log_vg/db_log_lv;
  }
  on host2 {
    address sci 192.168.106.2:7789;
    meta-disk internal;
    disk      /dev/db_log_vg/db_log_lv;
  }
}
resource r1 {
  syncer {
    rate 900M;
    cpu-mask 3;
  }
  device    /dev/drbd2;
  handlers {
      outdate-peer "/usr/lib64/heartbeat/drbd-peer-outdater";
  }
  disk { on-io-error detach;
    no-disk-barrier;
    no-disk-flushes;
    no-md-flushes;
    fencing resource-only;
  }
  on host1 {
    address sci 192.168.106.1:7790;
    meta-disk internal;
    disk      /dev/db_data_vg/db_data_lv;
  }
  on host2 {
    address sci 192.168.106.2:7790;
    meta-disk internal;
    disk      /dev/db_data_vg/db_data_lv;
  }
}

Thanks,
Sean Foley

This Churchill Downs Incorporated communication (including any attachments) is 
intended for the use of the intended recipient(s) only and may contain 
information that is confidential, privileged or legally protected. Any 
unauthorized use or dissemination of this communication is strictly prohibited. 
If you have received this communication in error, please immediately notify the 
sender by return e-mail message and delete all copies of the original 
communication. Thank you for your cooperation.

_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to