On 10/12/2011 02:34 PM, Kushnir, Michael (NIH/NLM/LHC) [C] wrote:
Digimer,

Thanks again for holding my hand on this. I've already started reading your 
wiki posts. I wish Google gave your site a better ranking. I've been doing 
research for months, and your articles (especially comments in the config 
files) are very helpful.

Happy it helps! Linking back to it might help. ;)

Also note that I had either leg of the bond routed through different switches. 
I had tried stacking them (hence the ability to LAG) but ran into issue there 
as well. So now for HA networking I use two independent switches, with a simple 
uplink between the switches, and mode=1. This configuration has tested very 
reliable for me.

I am using a single M4900 switch due to project budget issues right now. Once 
we go further toward production I intend to use two stacked M4900 switches. For 
now LACP hasn't been a problem. I will test with stacked M4900s and get back to 
you with my results.

Consider the possibility that you might one day want/need Red Hat support. In such a case, not using mode=1 will be a barrier. Obviously your build is to your spec, but do please carefully consider mode=1 before going into production.

Fencing is handled entirely within the cluster (cluster.conf). I use Lon's 
"obliterate-peer.sh" script as the DRBD fence-handler. When DRBD sees a split-brain, 
it blocks (with 'resource-and-stonith') and calls 'fence_node<victim>' and waits for a 
successful return. The result is that, on fault, the node gets fenced twice (once from the DRBD 
call, once from the cluster itself) but it works just fine.

Great explanation. Thanks!

If you are using IPMI (or other out-of-band BMC), be sure to also setup a 
switched PDU as a backup fence device (like an APC AP7900). Without this backup 
fencing method, your cluster will hang is a node loses power entirely because 
the survivor will not be able to talk to the IPMI interface to set/confirm the 
node state.

We are in an enterprise datacenter with two PDUs per rack, UPS, and generators. 
Also, the servers have two power supplies. So, I don't envision a power 
failure. The PDUs are owned and controlled by the infrastructure team, so IPMI 
is my only choice.

I've seen faults in the mainboard, in the cable going from the PSU to the mainboard and other such faults take out a server. Simply assuming the power will never fail is unwise. Deciding you can live with the risk of a hung cluster, however, is a choice you can make.

As for infrastructure restrictions; I deal with this by bringing in two of my own PDUs and running one off of either mains source (be it from another PDU, UPS or mains directly). Then I can configure and use the PDUs however I wish.

I've done the following DD tests:

1. Non-replicated DRBD volume with no FS

You mean StandAlone/Primary?

Yes.

2. Replicated DRBD volume with no FS

So Primary/Primary?

Yes. .

3. Replicated DRBD volume with GFS2 mounted locally

How else would you mount it?

See below.

4. Replicated DRBD volume with GFS2 mounted over GNBD

No input here, sorry.

See below.

5. Replicated DRBD volume with GFS2 mounted over iSCSI (IET)

Where does iSCSI fit into this? Are you making the DRBD resource a tgtd target 
and connecting to it locally or something?

In #1 and #2, I used "dd if=/dev/zero of=/dev/drbd0 oflag=direct bs=512K 
count=1000000". Results were great (almost the same as writing directly to /dev/sdb, 
which is the backing store to DRBD).

In #3, I used "mount -t gfs2 /dev/drbd0 /mnt" and then "dd if=/dev/zero 
of=/mnt/512K-testfile oflag=direct bs=512K count=1000000". Results were almost equally great 
(trivial performance loss).

In #4 and #5, I used my two DRBD boxes as storage servers and exported the DRBD volume via GNBD and 
iSCSI, respectively. I then connected a 3rd node (via same 10GbE equipment) and imported the 
volumes onto said 3rd node (again via GNBD and iSCSI, respectively). I set up round-robin 
multipath, and then mounted them using "mount -t gfs2 /dev/mpath/mpath1 /mnt". Then I ran 
"dd if=/dev/zero of=/mnt/512K-testfile oflag=direct bs=512K count=1000000". Results were 
horrible (not even 50% compared to #1-3).


So my setup looks like this:

DRBD (pri/pri)->gfs2->gnbd->multipath->mount.gfs2

I skipped clvmd because I do not need any of the features of LVM. My RAID 
volume is 4.8TB. We will replace equipment in 3 years, and in most aggressive 
estimates we will use 2.4TB at most within 3 years.


Thanks,
Mike

Simplify the remote mount test... export the raw DRBD over iscsi over a simple, non-redundant 10Gbit link. Mount the raw space as a simple ext3 partition and test again. If that tests well, start putting pieces back one at a time. If it tests bad, look at your network config.

As an aside, I've not use multipath because of warnings I got from others. This leaves me in a position where I can't rightly say why you shouldn't use it, but I'd try it without it.

Simple Simple Simple. Get it working, start layering it up. :)

--
Digimer
E-Mail:              [email protected]
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"At what point did we forget that the Space Shuttle was, essentially,
a program that strapped human beings to an explosion and tried to stab
through the sky with fire and math?"
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to