Re: [Pacemaker] OCF Resource agent promote question

Rainer Brestan Thu, 28 Mar 2013 05:34:46 -0700

Hi Steve,

when pre promote notify is called, Pacemaker has already selected a node to promote, you cannot stop this sequency any more.

As the node with the highest score should have been promoted, this is a code to fail slaves coming up after promote node selection.

If you try to force another node to promote, the running master must be first demoted (and then unavailable). But to make this part of the resource agent can be dangerous as you dont know which state a new slave is joining the cluster comes from.

If you start all nodes at the same time and not the latest xlog becomes master, then your settings for monitor interval and xlog_check_count is too small.

Master promotion score will be set after slave monitor interval * xlog_check_count starting the second of the first slave resource comes up.

As it is not predictable any more, which server comes up how fast and when the first slave instance comes up, this time should be not smaller then 30 seconds.

Rainer

Gesendet: Donnerstag, 28. März 2013 um 12:41 Uhr
Von: "Steven Bambling" <smbambl...@arin.net>
An: "The Pacemaker cluster resource manager" <pacemaker@oss.clusterlabs.org>
Betreff: Re: [Pacemaker] OCF Resource agent promote question

I'm reading the additions that you added to the pgsql resource agent to allow for streaming replication in Postgres 9.1+. I'm trying to determine if your resource agent will compensate if node promoted ( new master ) does not have the newest data.

From the looks of the pgsql_pre_promote function it seems that it will just fail other replicas (slaves) that have newer data, but will continue with the promotion of the new master even though it does not have the latest data.

If this is correct is there a way to force the promotion of the node with the newest data?

v/r

STEVE

On Mar 26, 2013, at 8:19 AM, Steven Bambling <smbambl...@arin.net> wrote:

Excellent thanks so much for the clarification. I'll drop this new RA in and see if I can get things working.

STEVE

On Mar 26, 2013, at 7:38 AM, Rainer Brestan <rainer.bres...@gmx.net>

wrote:

Hi Steve,

pgsql RA does the same, it compares the last_xlog_replay_location of all nodes for master promotion.

Doing a promote as a restart instead of promote command to conserve timeline id is also on configurable option (restart_on_promote) of the current RA.

And the RA is definitely capable of having more than two instances. It goes through the parameter node_list and doing its actions for every member in the node list.

Originally it might be planned only to have only one slave, but the current implementation does not have this limitation. It has code for sync replication of more than two nodes, when some of them fall back into async to not promote them.

Of course, i will share the extension with the community, when they are ready for use. And the feature of having more than two instances is not removed. I am not running more than two instances on one site, current usage is to have two instances on one site and having two sites and manage master by booth. But it also under discussion to have more than two instances on one site, just to have no availability interruption in case of one server down and the other promote with restart.

The implementation is nearly finished, then begins the stress tests of failure scenarios.

Rainer

Gesendet: Dienstag, 26. März 2013 um 11:55 Uhr
Von: "Steven Bambling" <smbambl...@arin.net>
An: "The Pacemaker cluster resource manager" <pacemaker@oss.clusterlabs.org>
Betreff: Re: [Pacemaker] OCF Resource agent promote question

On Mar 26, 2013, at 6:32 AM, Rainer Brestan <rainer.bres...@gmx.net> wrote:

Hi Steve,

when Pacemaker does promotion, it has already selected a specific node to become master.

It is far too late in this state to try to update master scores.

But there is another problem with xlog in PostgreSQL.

According to some discussion on PostgreSQL mailing lists, not relevant xlog entries dont go into the xlog counter during redo and/or start. This is specially true for CHECKPOINT xlog records, where this situation can be easely reproduced.

This can lead to the situation, where the replication is up to date, but the slave shows an lower xlog value.

This issue was solved in 9.2.3, where wal receiver always counts the end of applied records.

We are currently testing with 9.2.3. I'm using the functions http://www.databasesoup.com/2012/10/determining-furthest-ahead-replica.html along with tweaking a function to get the replay_lag in bytes to have a more accurate measurement.

There is also a second boring issue. The timeline change is replicated to the slaves, but they do not save it anywhere. In case slave starts up again and do not have access to the WAL archive, it cannot start any more. This was also addressed as patch in 9.2 branch, but i havent test if also fixed in 9.2.3.

After talking with one of the Postgres guys it was recommended that we look at an alternative solution to the built in trigger file that will make the master jump to a new timeline. We are in place moving the recovery.conf to recovery.done via the resource agent and then restarting the the postgresql service on the "new" master so that it maintains the original timeline that the slaves will recognize.

For data replication, no matter if PostgreSQL or any other database, you have always two choices of work.

- Data consistency is the top most priority. Dont go in operation, unless everything fine.

- Availability is the top most priority. Always try to have at least one running instance, even if data might not be latest.

The current pgsql RA does quite a good job for the first choice.

It currently has some limitations.

- After switchover, no matter of manual/automatic, it needs some work from maintenance personnel.

- Some failure scenarios of fault series lead to a non existing master without manual work.

- Geo-redundant replication with multi-site cluster ticket system (booth) does not work.

- If availability or unattended work is the priority, it cannot be used out of the box.

But it has a very good structure to be extended for other needs.

And this is what i currently implement.

Extend the RA to support both choices of work and prepare it for a multi-site cluster ticket system.

Would you be willing to share your extended RA? Also do you run a cluster with more then 2 nodes ?

v/r

STEVE

Regards, Rainer

Gesendet: Dienstag, 26. März 2013 um 00:01 Uhr
Von: "Andreas Kurz" <andr...@hastexo.com>
An: pacemaker@oss.clusterlabs.org
Betreff: Re: [Pacemaker] OCF Resource agent promote question

Hi Steve,

On 2013-03-25 18:44, Steven Bambling wrote:
> All,
>
> I'm trying to work on a OCF resource agent that uses postgresql
> streaming replication. I'm running into a few issues that I hope might
> be answered or at least some pointers given to steer me in the right
> direction.

Why are you not using the existing pgsql RA? It is capable of doing
synchronous and asynchronous replication and it is known to work fine.

Best regards,
Andreas

--
Need help with Pacemaker?
http://www.hastexo.com/now

>
> 1. A quick way of obtaining a list of "Online" nodes in the cluster
> that a resource will be able to migrate to. I've accomplished it with
> some grep and see but its not pretty or fast.
>
> # time pcs status | grep Online | sed -e "s/.*\[\(.*\)\]/\1/" | sed 's/ //'
> p1.example.net <http://p1.example.net> p2.example.net
> <http://p2.example.net>
>
> real0m2.797s
> user0m0.084s
> sys0m0.024s
>
> Once I get a list of active/online nodes in the cluster my thinking was
> to use PSQL to get the current xlog location and lag or each of the
> remaining nodes and compare them. If the node has a greater log
> position and/or less lag it will be given a greater master preference.
>
> 2. How to force a monitor/probe before a promote is run on ALL nodes to
> make sure that the master preference is up to date before
> migrating/failing over the resource.
> - I was thinking that maybe during the promote call it could get the log
> location and lag from each of the nodes via an psql call ( like above)
> and then force the resource to a specific node. Is there a way to do
> this and does it sound like a sane idea ?
>
>
> The start of my RA is located here suggestions and comments 100%
> welcome https://github.com/smbambling/pgsqlsr/blob/master/pgsqlsr
>
> v/r
>
> STEVE
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] OCF Resource agent promote question

Reply via email to