Re: [Linux-HA] pgsql OCF resource agent and other questions

Zoltan Boszormenyi Wed, 20 Feb 2008 03:13:49 -0800

Hi,

Zoltan Boszormenyi írta:

Serge Dubrouski írta:

On Feb 13, 2008 4:29 AM, Zoltan Boszormenyi <[EMAIL PROTECTED]> wrote:

Andrew Beekhof írta:

On Feb 12, 2008, at 8:57 PM, Zoltan Boszormenyi wrote:

Andrew Beekhof írta:

On Feb 12, 2008, at 4:59 PM, Zoltan Boszormenyi wrote:

Hi,

Serge Dubrouski írta:

pgsql OCF RA doesn't support multistate configuration so I don't
think
that creating a clone would be a good idea.

Thanks for the information.

Some other questions.

According to http://linux-ha.org/v2/faq/resource_too_active

the monitor action should return 0 for running, 7($OCF_NOT_RUNNING)

for downed resources and anything else for failed ones.
Either this documentation is buggy,

no

or heartbeat doesn't conform to its own docs.

also no

Here's the scenario: londiste creates a pidfile and deletes it when
it quits correctly.
However, if I kill it manually then the pidfile stays. What should
my script return
when it detects that the process with the indicated PID is no
longer there?
It's not a "downed" resource, it's a failed one. So I returned
$OCF_ERR_GENERIC.
But after some time heartbeat says that my resource became
"unmanaged".

i'm guessing (because you've not included anything on which to
comment properly) that the stop action failed

It shouldn't have failed, stop action always returns $OCF_SUCCESS.

In contrast to this, the pgsql OCF RA does it differently. It
always returns 7
when it finds that there's no postmaster process. Which is the
right behaviour?

it depends what you want to happen.
if you want a stop to be sent, use OCF_ERR_GENERIC.
if the resource is stateless and doesnt need any cleaning up, use
OCF_NOT_RUNNING

It's quite an important detail. Shouldn't this be documented at
http://linux-ha.org/OCFResourceAgent ?

yep.  but its a wiki so anyone can do that :)

I see. It's an excuse because no one did it yet. :-)

Yesterday another problem popped up and I don't understand why
didn't it happen before. I upgraded to heartbeat 2.1.3 using the
SuSe build service packages at
http://download.opensuse.org/repositories/server:/ha-clustering/
but the problem seems persisting. I have two pgsql resources,

using the stock install on my Fedora 6, i.e.pgdata=/var/lib/pgsql/data.

Both are tied to their respective nodes, symmetric_cluster on,
the constraints' score is -INFINITY for running them on the wrong node.
The documentation of heartbeat said that for a symmetric cluster
it's the way to bind a resource to a node (or to a set of nodes).

The problem is that after the first pgsql resource is startedsuccessfully

on the first node then the second pgsql resource is checked whether

it's running on the first node - surprise, surprise, the systemindicates

that it does. As a consequence, it's marked as startup failed and
heartbeat doesn't try to start it on the second node. Doing a cleanup
on the failed second pgsql resource makes it start but now the first
pgsql resource is marked failed. I guess because of the cleanup,
the second pgsql is thuoght to be running on node1 and is stopped.
The monitor action of the first resource notices that is's dead.
Catch 22?

Turning the configuration upside down (symmetric_cluster off
and using +INFINITY rsc_location scores for binding to the correct
node) didn't help.

How can I solve this besides using a different PGDATA directory
on the second node? The two machines is supposed to be configured
identically regarding PostgreSQL.


Attached is a patch for pgsql that supposedly fixes this issue. Please
test it and let me know the results.


I tested it. Unfortunately, it doesn't work as intended.
Below is the main chunk inlined for reference and explanation.

==============================
@@ -267,7 +278,14 @@
#

pgsql_status() {
-     pgrep -u  $OCF_RESKEY_pgdba "postmaster|postgres" >/dev/null 2>&1
+     if [ -r $PIDFILE ]
+     then
+         PID=`head -n 1 $PIDFILE`
+         kill -0 $PID 1>/dev/null 2>&1
+         return $?
+     fi

++ pgrep -u $OCF_RESKEY_pgdba -f "$OCF_RESKEY_status_pattern">/dev/null 2>&1

}

#
==============================

The above does this:
1. If the PIDFILE is readable, extract the PID from the file
   and test whether a process with that PID exists.
2. Else it goes to the process list and looks for a process owned by PGDBA
   and has the pattern (supposedly) in its name. However, it's not correct,
   the pattern is searched in the whole command line.

With your patch applied to the pgsql OCF RA in 2.1.3, I get this below.
Testing conditions:
- datadir1 is meant to be the PGDATA on the master
- datadir2 is meant to be the PGDATA on the slave
- no PostgreSQL instance running on the master
- there's a "su - postgres" shell open
- both commands are running on the master

# OCF_ROOT=/usr/lib/ocf OCF_RESKEY_pgport=2345 \
   OCF_RESKEY_pgdba=postgres \
   OCF_RESKEY_pgdata=datadir1 \
   OCF_RESKEY_pgctl=path-to-pg_ctl \
   OCF_RESKEY_psql=path-to-psql \
   OCF_RESKEY_status_pattern="postgres|postmaster" \
   ./pgsql monitor ; echo $?
2008/02/19_23:43:54 ERROR: PostgreSQL template1 isn't running
1

# OCF_ROOT=/usr/lib/ocf OCF_RESKEY_pgport=2345 \
   OCF_RESKEY_pgdba=postgres \
   OCF_RESKEY_pgdata=datadir2 \
   OCF_RESKEY_pgctl=path-to-pg_ctl \
   OCF_RESKEY_psql=path-to-psql \
   OCF_RESKEY_status_pattern="postgres|postmaster" \
   ./pgsql monitor ; echo $?
2008/02/19_23:43:54 ERROR: PostgreSQL template1 isn't running
1

Your patch didn't find PIDFILE but pgrep found a matching process!
Here's the proof:

# pgrep -u postgres -f "postgres|postmaster"
22798
# ps auxw | grep 22798

postgres 22798 0.0 0.1 23788 1300 pts/2 S 23:46 0:00 su -postgres

root     22860  0.0  0.1   3912   780 pts/5    S+   23:49   0:00 grep 22798

When I log out from the "su - postgres" shell and still no PostgreSQL
instance running, I got this response from both commands above:

2008/02/19_23:49:31 INFO: PostgreSQL is down
7

When I start pgsql with the OCF_RESKEY_pgdata=datadir1
setting on the master and run the "pgsql monitor" commands above,
it returns 0 for both pgdata=datadir1 and pgdata=datadir2.
So, it still doesn't work despite I have set the two nodes' PGDATA
differently. However, using pg_ctl in pgsql_status() solves the problem
just as I described below. Patch is attached this time, it's the same as
inlined below.

Best regards,
Zoltán Böszörményi

Thanks, but why not simply use pg_ctl status?

--- old_ocf/pgsql       2008-01-25 17:16:52.000000000 +0100
+++ pgsql       2008-02-11 06:15:28.000000000 +0100
@@ -267,7 +267,7 @@
#

pgsql_status() {
-     pgrep -u  $OCF_RESKEY_pgdba "postmaster|postgres" >/dev/null 2>&1

+ runasowner "$OCF_RESKEY_pgctl -D $OCF_RESKEY_pgdata status>/dev/null 2>&1"

}

#

This one above solved the problem for me.
But it requires that I use different PGDATA on different nodes.
It seems your patch isn't different in this regard. Your mods:
1. if $PIDFILE is readable, look at the PID it has
2. if not, look at the processes
It still have problems with identical PGDATA directories.
Do you have an idea how to distinguish in that case?

And a question about 2.1.3. After the upgrade, haclient couldn'tconnect

because mgmtd wasn't started. I needed to add these two lines to ha.cf:

apiauth         mgmtd   uid=root
respawn         root    /usr/lib64/heartbeat/mgmtd -v

Is it really needed? It wasn't for 2.0.8 and the docs say that
it's not necessary since 2.0.5. Documentation got outdated again,
or something broke?

--
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
http://www.postgresql.at/


_______________________________________________

Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

------------------------------------------------------------------------


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems



--
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
http://www.postgresql.at/

--- pgsql.orig	2008-02-19 23:23:17.000000000 +0100
+++ pgsql	2008-02-19 23:56:38.000000000 +0100
@@ -267,7 +267,7 @@
 #
 
 pgsql_status() {
-     pgrep -u  $OCF_RESKEY_pgdba "postmaster|postgres" >/dev/null 2>&1
+     runasowner "$OCF_RESKEY_pgctl -D $OCF_RESKEY_pgdata status >/dev/null 2>&1"
 }
 
 #

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] pgsql OCF resource agent and other questions

Reply via email to