Re: [Linux-HA] pgsql OCF resource agent and other questions

Serge Dubrouski Wed, 20 Feb 2008 05:58:55 -0800

Thanks, that looks like a good idea. I'll combine this patch with a c
couple of other changes and we'll submit it to development tree.


On Feb 20, 2008 4:13 AM, Zoltan Boszormenyi <[EMAIL PROTECTED]> wrote:
> Hi,
>
> Zoltan Boszormenyi írta:
>
> > Serge Dubrouski írta:
> >> On Feb 13, 2008 4:29 AM, Zoltan Boszormenyi <[EMAIL PROTECTED]> wrote:
> >>
> >>> Andrew Beekhof írta:
> >>>
> >>>> On Feb 12, 2008, at 8:57 PM, Zoltan Boszormenyi wrote:
> >>>>
> >>>>
> >>>>> Andrew Beekhof írta:
> >>>>>
> >>>>>> On Feb 12, 2008, at 4:59 PM, Zoltan Boszormenyi wrote:
> >>>>>>
> >>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> Serge Dubrouski írta:
> >>>>>>>
> >>>>>>>> pgsql OCF RA doesn't support multistate configuration so I don't
> >>>>>>>> think
> >>>>>>>> that creating a clone would be a good idea.
> >>>>>>>>
> >>>>>>>>
> >>>>>>> Thanks for the information.
> >>>>>>>
> >>>>>>> Some other questions.
> >>>>>>>
> >>>>>>> According to http://linux-ha.org/v2/faq/resource_too_active
> >>>>>>> the monitor action should return 0 for running, 7
> >>>>>>> ($OCF_NOT_RUNNING)
> >>>>>>> for downed resources and anything else for failed ones.
> >>>>>>> Either this documentation is buggy,
> >>>>>>>
> >>>>>> no
> >>>>>>
> >>>>>>
> >>>>>>> or heartbeat doesn't conform to its own docs.
> >>>>>>>
> >>>>>> also no
> >>>>>>
> >>>>>>
> >>>>>>> Here's the scenario: londiste creates a pidfile and deletes it when
> >>>>>>> it quits correctly.
> >>>>>>> However, if I kill it manually then the pidfile stays. What should
> >>>>>>> my script return
> >>>>>>> when it detects that the process with the indicated PID is no
> >>>>>>> longer there?
> >>>>>>> It's not a "downed" resource, it's a failed one. So I returned
> >>>>>>> $OCF_ERR_GENERIC.
> >>>>>>> But after some time heartbeat says that my resource became
> >>>>>>> "unmanaged".
> >>>>>>>
> >>>>>> i'm guessing (because you've not included anything on which to
> >>>>>> comment properly) that the stop action failed
> >>>>>>
> >>>>> It shouldn't have failed, stop action always returns $OCF_SUCCESS.
> >>>>>
> >>>>>
> >>>>>>> In contrast to this, the pgsql OCF RA does it differently. It
> >>>>>>> always returns 7
> >>>>>>> when it finds that there's no postmaster process. Which is the
> >>>>>>> right behaviour?
> >>>>>>>
> >>>>>> it depends what you want to happen.
> >>>>>> if you want a stop to be sent, use OCF_ERR_GENERIC.
> >>>>>> if the resource is stateless and doesnt need any cleaning up, use
> >>>>>> OCF_NOT_RUNNING
> >>>>>>
> >>>>> It's quite an important detail. Shouldn't this be documented at
> >>>>> http://linux-ha.org/OCFResourceAgent ?
> >>>>>
> >>>> yep.  but its a wiki so anyone can do that :)
> >>>>
> >>> I see. It's an excuse because no one did it yet. :-)
> >>>
> >>> Yesterday another problem popped up and I don't understand why
> >>> didn't it happen before. I upgraded to heartbeat 2.1.3 using the
> >>> SuSe build service packages at
> >>> http://download.opensuse.org/repositories/server:/ha-clustering/
> >>> but the problem seems persisting. I have two pgsql resources,
> >>> using the stock install on my Fedora 6, i.e.
> >>> pgdata=/var/lib/pgsql/data.
> >>> Both are tied to their respective nodes, symmetric_cluster on,
> >>> the constraints' score is -INFINITY for running them on the wrong node.
> >>> The documentation of heartbeat said that for a symmetric cluster
> >>> it's the way to bind a resource to a node (or to a set of nodes).
> >>> The problem is that after the first pgsql resource is started
> >>> successfully
> >>> on the first node then the second pgsql resource is checked whether
> >>> it's running on the first node - surprise, surprise, the system
> >>> indicates
> >>> that it does. As a consequence, it's marked as startup failed and
> >>> heartbeat doesn't try to start it on the second node. Doing a cleanup
> >>> on the failed second pgsql resource makes it start but now the first
> >>> pgsql resource is marked failed. I guess because of the cleanup,
> >>> the second pgsql is thuoght to be running on node1 and is stopped.
> >>> The monitor action of the first resource notices that is's dead.
> >>> Catch 22?
> >>>
> >>> Turning the configuration upside down (symmetric_cluster off
> >>> and using +INFINITY rsc_location scores for binding to the correct
> >>> node) didn't help.
> >>>
> >>> How can I solve this besides using a different PGDATA directory
> >>> on the second node? The two machines is supposed to be configured
> >>> identically regarding PostgreSQL.
> >>>
> >>
> >> Attached is a patch for pgsql that supposedly fixes this issue. Please
> >> test it and let me know the results.
>
> I tested it. Unfortunately, it doesn't work as intended.
> Below is the main chunk inlined for reference and explanation.
>
> ==============================
> @@ -267,7 +278,14 @@
>  #
>
>  pgsql_status() {
> -     pgrep -u  $OCF_RESKEY_pgdba "postmaster|postgres" >/dev/null 2>&1
> +     if [ -r $PIDFILE ]
> +     then
> +         PID=`head -n 1 $PIDFILE`
> +         kill -0 $PID 1>/dev/null 2>&1
> +         return $?
> +     fi
> +
> +     pgrep -u $OCF_RESKEY_pgdba -f "$OCF_RESKEY_status_pattern"
>  >/dev/null 2>&1
>  }
>
>  #
> ==============================
>
> The above does this:
> 1. If the PIDFILE is readable, extract the PID from the file
>     and test whether a process with that PID exists.
> 2. Else it goes to the process list and looks for a process owned by PGDBA
>     and has the pattern (supposedly) in its name. However, it's not correct,
>     the pattern is searched in the whole command line.
>
> With your patch applied to the pgsql OCF RA in 2.1.3, I get this below.
> Testing conditions:
> - datadir1 is meant to be the PGDATA on the master
> - datadir2 is meant to be the PGDATA on the slave
> - no PostgreSQL instance running on the master
> - there's a "su - postgres" shell open
> - both commands are running on the master
>
> # OCF_ROOT=/usr/lib/ocf OCF_RESKEY_pgport=2345 \
>     OCF_RESKEY_pgdba=postgres \
>     OCF_RESKEY_pgdata=datadir1 \
>     OCF_RESKEY_pgctl=path-to-pg_ctl \
>     OCF_RESKEY_psql=path-to-psql \
>     OCF_RESKEY_status_pattern="postgres|postmaster" \
>     ./pgsql monitor ; echo $?
> 2008/02/19_23:43:54 ERROR: PostgreSQL template1 isn't running
> 1
>
> # OCF_ROOT=/usr/lib/ocf OCF_RESKEY_pgport=2345 \
>     OCF_RESKEY_pgdba=postgres \
>     OCF_RESKEY_pgdata=datadir2 \
>     OCF_RESKEY_pgctl=path-to-pg_ctl \
>     OCF_RESKEY_psql=path-to-psql \
>     OCF_RESKEY_status_pattern="postgres|postmaster" \
>     ./pgsql monitor ; echo $?
> 2008/02/19_23:43:54 ERROR: PostgreSQL template1 isn't running
> 1
>
> Your patch didn't find PIDFILE but pgrep found a matching process!
> Here's the proof:
>
> # pgrep -u postgres -f "postgres|postmaster"
> 22798
> # ps auxw | grep 22798
> postgres 22798  0.0  0.1  23788  1300 pts/2    S    23:46   0:00 su -
> postgres
> root     22860  0.0  0.1   3912   780 pts/5    S+   23:49   0:00 grep 22798
>
> When I log out from the "su - postgres" shell and still no PostgreSQL
> instance running, I got this response from both commands above:
>
> 2008/02/19_23:49:31 INFO: PostgreSQL is down
> 7
>
> When I start pgsql with the OCF_RESKEY_pgdata=datadir1
> setting on the master and run the "pgsql monitor" commands above,
> it returns 0 for both pgdata=datadir1 and pgdata=datadir2.
> So, it still doesn't work despite I have set the two nodes' PGDATA
> differently. However, using pg_ctl in pgsql_status() solves the problem
> just as I described below. Patch is attached this time, it's the same as
> inlined below.
>
> Best regards,
> Zoltán Böszörményi
>
>
> > Thanks, but why not simply use pg_ctl status?
> >
> > --- old_ocf/pgsql       2008-01-25 17:16:52.000000000 +0100
> > +++ pgsql       2008-02-11 06:15:28.000000000 +0100
> > @@ -267,7 +267,7 @@
> > #
> >
> > pgsql_status() {
> > -     pgrep -u  $OCF_RESKEY_pgdba "postmaster|postgres" >/dev/null 2>&1
> > +    runasowner "$OCF_RESKEY_pgctl -D $OCF_RESKEY_pgdata status
> > >/dev/null 2>&1"
> > }
> >
> > #
> >
> > This one above solved the problem for me.
> > But it requires that I use different PGDATA on different nodes.
> > It seems your patch isn't different in this regard. Your mods:
> > 1. if $PIDFILE is readable, look at the PID it has
> > 2. if not, look at the processes
> > It still have problems with identical PGDATA directories.
> > Do you have an idea how to distinguish in that case?
> >
> >>
> >>> And a question about 2.1.3. After the upgrade, haclient couldn't
> >>> connect
> >>> because mgmtd wasn't started. I needed to add these two lines to ha.cf:
> >>>
> >>> apiauth         mgmtd   uid=root
> >>> respawn         root    /usr/lib64/heartbeat/mgmtd -v
> >>>
> >>> Is it really needed? It wasn't for 2.0.8 and the docs say that
> >>> it's not necessary since 2.0.5. Documentation got outdated again,
> >>> or something broke?
> >>>
> >>> --
> >>> ----------------------------------
> >>> Zoltán Böszörményi
> >>> Cybertec Schönig & Schönig GmbH
> >>> http://www.postgresql.at/
> >>>
> >>>
> >>> _______________________________________________
> >>>
> >>> Linux-HA mailing list
> >>> [email protected]
> >>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >>> See also: http://linux-ha.org/ReportingProblems
> >>>
> >>>
> >>
> >>
> >>
> >>
> >> ------------------------------------------------------------------------
> >>
> >> _______________________________________________
> >> Linux-HA mailing list
> >> [email protected]
> >> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >> See also: http://linux-ha.org/ReportingProblems
> >
> >
>
>
> --
> ----------------------------------
> Zoltán Böszörményi
> Cybertec Schönig & Schönig GmbH
> http://www.postgresql.at/
>
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>



-- 
Serge Dubrouski.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] pgsql OCF resource agent and other questions

Reply via email to