Hi aTTi,
Comments in-line;
On 18/09/14 02:22 PM, aTTi wrote:
Hi Digimer!
Thanks your answer. I had a lot of questions and not just for Digimer - for all.
So, if I had just 2 nodes with disabled quorum and I use fencing (aka
STONITH) + pacemaker, it will be safe for production use? (other
recommended settings what is not default? any howto?)
"Production ready" requires many things. Fencing is one of those things,
of course, but there are others.
Details are hard to give without a better idea of your environment...
What operating system? What versions of corosync, pacemaker and DRBD? etc.
With 2-node clusters, you need to put a delay on one node, and you need
to be careful to avoid fence loops. That is to say, either don't let the
cluster stack start on boot (always my recommendation), or at least use
wait_for_all if you have corosync v2+.
See:
https://alteeve.ca/w/AN!Cluster_Tutorial_2#Giving_Nodes_More_Time_to_Start_and_Avoiding_.22Fence_Loops.22
If the STONITH kills the slower node, it not makes data loss for
slower server? It's a remote shutdown or power off / reset ? Or same
as I start a shutdown as root?
With DRBD, both nodes stop writing when connection is lost. This way,
when the slower node is powered off, no data is lost. If your OS itself
uses a journaled file system and you're not doing something silly like
using hardware RAID in write-through mode without a BBU, then the OS
should be safe as well.
When the fenced server boots back up, DRBD on the surviving node will
know just which blocks changed when the peer was gone, so it only has to
copy that data to bring the peer back up to full sync state.
So, if communication will break, happenings will be same in a western
movie: faster kills the slower and only 1 will alive. Both node will
die - it can be happen?
It can happen that both nodes die in some cases. This can be avoided
with a few precautions; disable acpid if you have IPMI fencing and set a
delay against one node.
Please read the section immediately below the example config file here:
https://alteeve.ca/w/AN!Cluster_Tutorial_2#Using_the_Fence_Devices
With good setup and with no hardware error what is the most problems
with DRBD? How can I proof that?
With good fencing, there are no problems. I have used in it production
since 2009 on dozens of 2-node clusters all over north america. The
trick is the good fencing.
How can I find a documentation about DRBD test cases? Or recommended
configurations and installation manual for 2 node with Centos 7?
I don't know how much documentation exists for CentOS 7, it is very new.
However, the concepts in CentOS 6 are very similar.
You can read here a lot about the logic and concepts behind how we use
DRBD in our 2-node clusters here:
https://alteeve.ca/w/AN!Cluster_Tutorial_2#Hooking_DRBD_into_the_Cluster.27s_Fencing
Example situation:
server 1 = DRBD active node with running services, server 2 = DRBD passive node
server 1 had hardware error, went offline, server 2 will the active node
server 2 set the virtual IP what needed for active, then starting services
after server 1 hardware repair, server 1 will online again
how can I switch back the most safest way if STONITH installed to
server 1 be the active and server 2 be the passive node? I need a
script? Or just few commands?
As soon as there is a problem, both nodes block and call a fence. The
faster node powers off the slower node, gets confirmation that it is
off, and *then* begins recovery. Maybe the fenced node will boot back
up, or maybe it's a pile of rubble and will never power on again... it
doesn't matter to the cluster.
Once the node is gone, the surviving machine will review the pacemaker
configuration, determine what has to be done to recover your services,
and then do that. What "that" means will depend entirely on your
configuration.
An example might be to:
1. promote DRBD to primary
2. mount the file system on drbd
3. start a service like httpd or postgresql that uses the DRBD data
4. take over the virtual IP address
This is just an example though.
Other situation:
Any real life experience about to periodically (weekly, monthly)
change the active and passive nodes? Like in the last example, server
1 active, server 2 passive, then monthly I change to be active the
node 2. In January the active server 1 the active node, in February
the server 2 is the active, in March again the server 1 will the
active... for same server wear/abrasion.
Migration of services can be controlled however you want, but time-based
migrations is not something I have seen. Nothing stops you from manually
moving the services though, if you want. Generally though, services
migrate in reaction to a specific event, like a component failure.
You recommend me to use 3. node as backup node or not? And in what way
to use the third node? As stacked node? Or ISCSI sync? Or normal
passive node? (I don't want it. I want to be my DRBD solution simple
and safe.)
A cluster does _NOT_ replace backups. You still need backups, always.
Generally, I have a dedicated machine, in another building, that
periodically rsync's the production data into a date-coded directory.
This way, I can go back in time to retrieve deleted or corrupted files.
How you setup your backup though, is entirely up to you. Backup is very
different from HA.
Can I combine DRBD server pairs? Like server 1+2 is DRBD1 node 1+2,
and server 3+4 is DRBD2 node 1+2. Then adding to DRBD1 the server 3 or
4, and for DRBD2 adding 3. node the server 1 or 2? Any point of this?
Or to make more strange: adding DRBD1 node 3 storage space to DRBD2's
disk space?
I think it's not a good idea just I want to know. Also I had disk
space for that, just asking as theoretically.
I don't know if it is possible, but I think it would be.
If DRBD really safe with 2 nodes, I don't want use more nodes. I will
make auto backup from data, I just want HA and no service stop and no
data loss if server error. I know, DRBD just one part of HA solution,
but it's important part.
As I said, I have used DRBD in 2-node clusters only for several years
without any issue.
You recommend to use at least 2 ring level with corosync? level 1 =
crossover cable, level 2 = switch connection. Any disadvantages of
that?
It's up to you. I use active/passive bonding with the network links
spanning two switches for full network redundancy. Redundant rings are
good, too. I go with bonding only because it protects all traffic,
including DRBD traffic.
Thank you again for your help.
aTTi
Always happy to help.
PS - Please keep replies on the mailing list. Conversations like this
can help others in the future when they are in the archives.
--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user