Hi aTTi,

  Comments in-line;

On 18/09/14 02:22 PM, aTTi wrote:
Hi Digimer!

Thanks your answer. I had a lot of questions and not just for Digimer - for all.

So, if I had just 2 nodes with disabled quorum and I use fencing (aka
STONITH) + pacemaker, it will be safe for production use? (other
recommended settings what is not default? any howto?)

"Production ready" requires many things. Fencing is one of those things, of course, but there are others.

Details are hard to give without a better idea of your environment... What operating system? What versions of corosync, pacemaker and DRBD? etc.

With 2-node clusters, you need to put a delay on one node, and you need to be careful to avoid fence loops. That is to say, either don't let the cluster stack start on boot (always my recommendation), or at least use wait_for_all if you have corosync v2+.

See:

https://alteeve.ca/w/AN!Cluster_Tutorial_2#Giving_Nodes_More_Time_to_Start_and_Avoiding_.22Fence_Loops.22

If the STONITH kills the slower node, it not makes data loss for
slower server? It's a remote shutdown or power off / reset ? Or same
as I start a shutdown as root?

With DRBD, both nodes stop writing when connection is lost. This way, when the slower node is powered off, no data is lost. If your OS itself uses a journaled file system and you're not doing something silly like using hardware RAID in write-through mode without a BBU, then the OS should be safe as well.

When the fenced server boots back up, DRBD on the surviving node will know just which blocks changed when the peer was gone, so it only has to copy that data to bring the peer back up to full sync state.

So, if communication will break, happenings will be same in a western
movie: faster kills the slower and only 1 will alive. Both node will
die - it can be happen?

It can happen that both nodes die in some cases. This can be avoided with a few precautions; disable acpid if you have IPMI fencing and set a delay against one node.

Please read the section immediately below the example config file here:

https://alteeve.ca/w/AN!Cluster_Tutorial_2#Using_the_Fence_Devices

With good setup and with no hardware error what is the most problems
with DRBD? How can I proof that?

With good fencing, there are no problems. I have used in it production since 2009 on dozens of 2-node clusters all over north america. The trick is the good fencing.

How can I find a documentation about DRBD test cases? Or recommended
configurations and installation manual for 2 node with Centos 7?

I don't know how much documentation exists for CentOS 7, it is very new. However, the concepts in CentOS 6 are very similar.

You can read here a lot about the logic and concepts behind how we use DRBD in our 2-node clusters here:

https://alteeve.ca/w/AN!Cluster_Tutorial_2#Hooking_DRBD_into_the_Cluster.27s_Fencing

Example situation:
server 1 = DRBD active node with running services, server 2  = DRBD passive node
server 1 had hardware error, went offline, server 2 will the active node
server 2 set the virtual IP what needed for active, then starting services
after server 1 hardware repair, server 1 will online again
how can I switch back the most safest way if STONITH installed to
server 1 be the active and server 2 be the passive node? I need a
script? Or just few commands?

As soon as there is a problem, both nodes block and call a fence. The faster node powers off the slower node, gets confirmation that it is off, and *then* begins recovery. Maybe the fenced node will boot back up, or maybe it's a pile of rubble and will never power on again... it doesn't matter to the cluster.

Once the node is gone, the surviving machine will review the pacemaker configuration, determine what has to be done to recover your services, and then do that. What "that" means will depend entirely on your configuration.

An example might be to:

1. promote DRBD to primary
2. mount the file system on drbd
3. start a service like httpd or postgresql that uses the DRBD data
4. take over the virtual IP address

This is just an example though.

Other situation:
Any real life experience about to periodically (weekly, monthly)
change the active and passive nodes? Like in the last example, server
1 active, server 2 passive, then monthly I change to be active the
node 2. In January the active server 1 the active node, in February
the server 2 is the active, in March again the server 1 will the
active... for same server wear/abrasion.

Migration of services can be controlled however you want, but time-based migrations is not something I have seen. Nothing stops you from manually moving the services though, if you want. Generally though, services migrate in reaction to a specific event, like a component failure.

You recommend me to use 3. node as backup node or not? And in what way
to use the third node? As stacked node? Or ISCSI sync? Or normal
passive node? (I don't want it. I want to be my DRBD solution simple
and safe.)

A cluster does _NOT_ replace backups. You still need backups, always. Generally, I have a dedicated machine, in another building, that periodically rsync's the production data into a date-coded directory. This way, I can go back in time to retrieve deleted or corrupted files.

How you setup your backup though, is entirely up to you. Backup is very different from HA.

Can I combine DRBD server pairs? Like server 1+2 is DRBD1 node 1+2,
and server 3+4 is DRBD2 node 1+2. Then adding to DRBD1 the server 3 or
4, and for DRBD2 adding 3. node the server 1 or 2? Any point of this?
Or to make more strange: adding DRBD1 node 3 storage space to DRBD2's
disk space?
I think it's not a good idea just I want to know. Also I had disk
space for that, just asking as theoretically.

I don't know if it is possible, but I think it would be.

If DRBD really safe with 2 nodes, I don't want use more nodes. I will
make auto backup from data, I just want HA and no service stop and no
data loss if server error. I know, DRBD just one part of HA solution,
but it's important part.

As I said, I have used DRBD in 2-node clusters only for several years without any issue.

You recommend to use at least 2 ring level with corosync? level 1 =
crossover cable, level 2 = switch connection. Any disadvantages of
that?

It's up to you. I use active/passive bonding with the network links spanning two switches for full network redundancy. Redundant rings are good, too. I go with bonding only because it protects all traffic, including DRBD traffic.

Thank you again for your help.
aTTi

Always happy to help.

PS - Please keep replies on the mailing list. Conversations like this can help others in the future when they are in the archives.

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without access to education?
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to