Re: [DRBD-user] DRBD (XFS) + Pacemaker + Corosync with 2 node and arbiter (virtual node) for no split brain: Stonith, Quorum needed?

Digimer Fri, 19 Sep 2014 01:30:55 -0700

Hi aTTi,

  Comments in-line;


On 18/09/14 02:22 PM, aTTi wrote:

Hi Digimer!

Thanks your answer. I had a lot of questions and not just for Digimer - for all.

So, if I had just 2 nodes with disabled quorum and I use fencing (aka
STONITH) + pacemaker, it will be safe for production use? (other
recommended settings what is not default? any howto?)

"Production ready" requires many things. Fencing is one of those things,of course, but there are others.

Details are hard to give without a better idea of your environment...What operating system? What versions of corosync, pacemaker and DRBD? etc.

With 2-node clusters, you need to put a delay on one node, and you needto be careful to avoid fence loops. That is to say, either don't let thecluster stack start on boot (always my recommendation), or at least usewait_for_all if you have corosync v2+.


See:

https://alteeve.ca/w/AN!Cluster_Tutorial_2#Giving_Nodes_More_Time_to_Start_and_Avoiding_.22Fence_Loops.22

If the STONITH kills the slower node, it not makes data loss for
slower server? It's a remote shutdown or power off / reset ? Or same
as I start a shutdown as root?

With DRBD, both nodes stop writing when connection is lost. This way,when the slower node is powered off, no data is lost. If your OS itselfuses a journaled file system and you're not doing something silly likeusing hardware RAID in write-through mode without a BBU, then the OSshould be safe as well.

When the fenced server boots back up, DRBD on the surviving node willknow just which blocks changed when the peer was gone, so it only has tocopy that data to bring the peer back up to full sync state.

So, if communication will break, happenings will be same in a western
movie: faster kills the slower and only 1 will alive. Both node will
die - it can be happen?

It can happen that both nodes die in some cases. This can be avoidedwith a few precautions; disable acpid if you have IPMI fencing and set adelay against one node.


Please read the section immediately below the example config file here:

https://alteeve.ca/w/AN!Cluster_Tutorial_2#Using_the_Fence_Devices

With good setup and with no hardware error what is the most problems
with DRBD? How can I proof that?

With good fencing, there are no problems. I have used in it productionsince 2009 on dozens of 2-node clusters all over north america. Thetrick is the good fencing.

How can I find a documentation about DRBD test cases? Or recommended
configurations and installation manual for 2 node with Centos 7?

I don't know how much documentation exists for CentOS 7, it is very new.However, the concepts in CentOS 6 are very similar.

You can read here a lot about the logic and concepts behind how we useDRBD in our 2-node clusters here:


https://alteeve.ca/w/AN!Cluster_Tutorial_2#Hooking_DRBD_into_the_Cluster.27s_Fencing

Example situation:
server 1 = DRBD active node with running services, server 2  = DRBD passive node
server 1 had hardware error, went offline, server 2 will the active node
server 2 set the virtual IP what needed for active, then starting services
after server 1 hardware repair, server 1 will online again
how can I switch back the most safest way if STONITH installed to
server 1 be the active and server 2 be the passive node? I need a
script? Or just few commands?

As soon as there is a problem, both nodes block and call a fence. Thefaster node powers off the slower node, gets confirmation that it isoff, and *then* begins recovery. Maybe the fenced node will boot backup, or maybe it's a pile of rubble and will never power on again... itdoesn't matter to the cluster.

Once the node is gone, the surviving machine will review the pacemakerconfiguration, determine what has to be done to recover your services,and then do that. What "that" means will depend entirely on yourconfiguration.


An example might be to:

1. promote DRBD to primary
2. mount the file system on drbd
3. start a service like httpd or postgresql that uses the DRBD data
4. take over the virtual IP address

This is just an example though.

Other situation:
Any real life experience about to periodically (weekly, monthly)
change the active and passive nodes? Like in the last example, server
1 active, server 2 passive, then monthly I change to be active the
node 2. In January the active server 1 the active node, in February
the server 2 is the active, in March again the server 1 will the
active... for same server wear/abrasion.

Migration of services can be controlled however you want, but time-basedmigrations is not something I have seen. Nothing stops you from manuallymoving the services though, if you want. Generally though, servicesmigrate in reaction to a specific event, like a component failure.

You recommend me to use 3. node as backup node or not? And in what way
to use the third node? As stacked node? Or ISCSI sync? Or normal
passive node? (I don't want it. I want to be my DRBD solution simple
and safe.)

A cluster does _NOT_ replace backups. You still need backups, always.Generally, I have a dedicated machine, in another building, thatperiodically rsync's the production data into a date-coded directory.This way, I can go back in time to retrieve deleted or corrupted files.

How you setup your backup though, is entirely up to you. Backup is verydifferent from HA.

Can I combine DRBD server pairs? Like server 1+2 is DRBD1 node 1+2,
and server 3+4 is DRBD2 node 1+2. Then adding to DRBD1 the server 3 or
4, and for DRBD2 adding 3. node the server 1 or 2? Any point of this?
Or to make more strange: adding DRBD1 node 3 storage space to DRBD2's
disk space?
I think it's not a good idea just I want to know. Also I had disk
space for that, just asking as theoretically.


I don't know if it is possible, but I think it would be.

If DRBD really safe with 2 nodes, I don't want use more nodes. I will
make auto backup from data, I just want HA and no service stop and no
data loss if server error. I know, DRBD just one part of HA solution,
but it's important part.

As I said, I have used DRBD in 2-node clusters only for several yearswithout any issue.

You recommend to use at least 2 ring level with corosync? level 1 =
crossover cable, level 2 = switch connection. Any disadvantages of
that?

It's up to you. I use active/passive bonding with the network linksspanning two switches for full network redundancy. Redundant rings aregood, too. I go with bonding only because it protects all traffic,including DRBD traffic.

Thank you again for your help.
aTTi


Always happy to help.

PS - Please keep replies on the mailing list. Conversations like thiscan help others in the future when they are in the archives.


--
Digimer
Papers and Projects: https://alteeve.ca/w/

What if the cure for cancer is trapped in the mind of a person withoutaccess to education?

_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Re: [DRBD-user] DRBD (XFS) + Pacemaker + Corosync with 2 node and arbiter (virtual node) for no split brain: Stonith, Quorum needed?

Reply via email to