On 10/12/2011 02:34 PM, Kushnir, Michael (NIH/NLM/LHC) [C] wrote:
Digimer,
Thanks again for holding my hand on this. I've already started reading your
wiki posts. I wish Google gave your site a better ranking. I've been doing
research for months, and your articles (especially comments in the config
files) are very helpful.
Happy it helps! Linking back to it might help. ;)
Also note that I had either leg of the bond routed through different switches.
I had tried stacking them (hence the ability to LAG) but ran into issue there
as well. So now for HA networking I use two independent switches, with a simple
uplink between the switches, and mode=1. This configuration has tested very
reliable for me.
I am using a single M4900 switch due to project budget issues right now. Once
we go further toward production I intend to use two stacked M4900 switches. For
now LACP hasn't been a problem. I will test with stacked M4900s and get back to
you with my results.
Consider the possibility that you might one day want/need Red Hat
support. In such a case, not using mode=1 will be a barrier. Obviously
your build is to your spec, but do please carefully consider mode=1
before going into production.
Fencing is handled entirely within the cluster (cluster.conf). I use Lon's
"obliterate-peer.sh" script as the DRBD fence-handler. When DRBD sees a split-brain,
it blocks (with 'resource-and-stonith') and calls 'fence_node<victim>' and waits for a
successful return. The result is that, on fault, the node gets fenced twice (once from the DRBD
call, once from the cluster itself) but it works just fine.
Great explanation. Thanks!
If you are using IPMI (or other out-of-band BMC), be sure to also setup a
switched PDU as a backup fence device (like an APC AP7900). Without this backup
fencing method, your cluster will hang is a node loses power entirely because
the survivor will not be able to talk to the IPMI interface to set/confirm the
node state.
We are in an enterprise datacenter with two PDUs per rack, UPS, and generators.
Also, the servers have two power supplies. So, I don't envision a power
failure. The PDUs are owned and controlled by the infrastructure team, so IPMI
is my only choice.
I've seen faults in the mainboard, in the cable going from the PSU to
the mainboard and other such faults take out a server. Simply assuming
the power will never fail is unwise. Deciding you can live with the risk
of a hung cluster, however, is a choice you can make.
As for infrastructure restrictions; I deal with this by bringing in two
of my own PDUs and running one off of either mains source (be it from
another PDU, UPS or mains directly). Then I can configure and use the
PDUs however I wish.
I've done the following DD tests:
1. Non-replicated DRBD volume with no FS
You mean StandAlone/Primary?
Yes.
2. Replicated DRBD volume with no FS
So Primary/Primary?
Yes. .
3. Replicated DRBD volume with GFS2 mounted locally
How else would you mount it?
See below.
4. Replicated DRBD volume with GFS2 mounted over GNBD
No input here, sorry.
See below.
5. Replicated DRBD volume with GFS2 mounted over iSCSI (IET)
Where does iSCSI fit into this? Are you making the DRBD resource a tgtd target
and connecting to it locally or something?
In #1 and #2, I used "dd if=/dev/zero of=/dev/drbd0 oflag=direct bs=512K
count=1000000". Results were great (almost the same as writing directly to /dev/sdb,
which is the backing store to DRBD).
In #3, I used "mount -t gfs2 /dev/drbd0 /mnt" and then "dd if=/dev/zero
of=/mnt/512K-testfile oflag=direct bs=512K count=1000000". Results were almost equally great
(trivial performance loss).
In #4 and #5, I used my two DRBD boxes as storage servers and exported the DRBD volume via GNBD and
iSCSI, respectively. I then connected a 3rd node (via same 10GbE equipment) and imported the
volumes onto said 3rd node (again via GNBD and iSCSI, respectively). I set up round-robin
multipath, and then mounted them using "mount -t gfs2 /dev/mpath/mpath1 /mnt". Then I ran
"dd if=/dev/zero of=/mnt/512K-testfile oflag=direct bs=512K count=1000000". Results were
horrible (not even 50% compared to #1-3).
So my setup looks like this:
DRBD (pri/pri)->gfs2->gnbd->multipath->mount.gfs2
I skipped clvmd because I do not need any of the features of LVM. My RAID
volume is 4.8TB. We will replace equipment in 3 years, and in most aggressive
estimates we will use 2.4TB at most within 3 years.
Thanks,
Mike
Simplify the remote mount test... export the raw DRBD over iscsi over a
simple, non-redundant 10Gbit link. Mount the raw space as a simple ext3
partition and test again. If that tests well, start putting pieces back
one at a time. If it tests bad, look at your network config.
As an aside, I've not use multipath because of warnings I got from
others. This leaves me in a position where I can't rightly say why you
shouldn't use it, but I'd try it without it.
Simple Simple Simple. Get it working, start layering it up. :)
--
Digimer
E-Mail: [email protected]
Freenode handle: digimer
Papers and Projects: http://alteeve.com
Node Assassin: http://nodeassassin.org
"At what point did we forget that the Space Shuttle was, essentially,
a program that strapped human beings to an explosion and tried to stab
through the sky with fire and math?"
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user