Thanks, that looks like a good idea. I'll combine this patch with a c couple of other changes and we'll submit it to development tree.
On Feb 20, 2008 4:13 AM, Zoltan Boszormenyi <[EMAIL PROTECTED]> wrote: > Hi, > > Zoltan Boszormenyi írta: > > > Serge Dubrouski írta: > >> On Feb 13, 2008 4:29 AM, Zoltan Boszormenyi <[EMAIL PROTECTED]> wrote: > >> > >>> Andrew Beekhof írta: > >>> > >>>> On Feb 12, 2008, at 8:57 PM, Zoltan Boszormenyi wrote: > >>>> > >>>> > >>>>> Andrew Beekhof írta: > >>>>> > >>>>>> On Feb 12, 2008, at 4:59 PM, Zoltan Boszormenyi wrote: > >>>>>> > >>>>>> > >>>>>>> Hi, > >>>>>>> > >>>>>>> Serge Dubrouski írta: > >>>>>>> > >>>>>>>> pgsql OCF RA doesn't support multistate configuration so I don't > >>>>>>>> think > >>>>>>>> that creating a clone would be a good idea. > >>>>>>>> > >>>>>>>> > >>>>>>> Thanks for the information. > >>>>>>> > >>>>>>> Some other questions. > >>>>>>> > >>>>>>> According to http://linux-ha.org/v2/faq/resource_too_active > >>>>>>> the monitor action should return 0 for running, 7 > >>>>>>> ($OCF_NOT_RUNNING) > >>>>>>> for downed resources and anything else for failed ones. > >>>>>>> Either this documentation is buggy, > >>>>>>> > >>>>>> no > >>>>>> > >>>>>> > >>>>>>> or heartbeat doesn't conform to its own docs. > >>>>>>> > >>>>>> also no > >>>>>> > >>>>>> > >>>>>>> Here's the scenario: londiste creates a pidfile and deletes it when > >>>>>>> it quits correctly. > >>>>>>> However, if I kill it manually then the pidfile stays. What should > >>>>>>> my script return > >>>>>>> when it detects that the process with the indicated PID is no > >>>>>>> longer there? > >>>>>>> It's not a "downed" resource, it's a failed one. So I returned > >>>>>>> $OCF_ERR_GENERIC. > >>>>>>> But after some time heartbeat says that my resource became > >>>>>>> "unmanaged". > >>>>>>> > >>>>>> i'm guessing (because you've not included anything on which to > >>>>>> comment properly) that the stop action failed > >>>>>> > >>>>> It shouldn't have failed, stop action always returns $OCF_SUCCESS. > >>>>> > >>>>> > >>>>>>> In contrast to this, the pgsql OCF RA does it differently. It > >>>>>>> always returns 7 > >>>>>>> when it finds that there's no postmaster process. Which is the > >>>>>>> right behaviour? > >>>>>>> > >>>>>> it depends what you want to happen. > >>>>>> if you want a stop to be sent, use OCF_ERR_GENERIC. > >>>>>> if the resource is stateless and doesnt need any cleaning up, use > >>>>>> OCF_NOT_RUNNING > >>>>>> > >>>>> It's quite an important detail. Shouldn't this be documented at > >>>>> http://linux-ha.org/OCFResourceAgent ? > >>>>> > >>>> yep. but its a wiki so anyone can do that :) > >>>> > >>> I see. It's an excuse because no one did it yet. :-) > >>> > >>> Yesterday another problem popped up and I don't understand why > >>> didn't it happen before. I upgraded to heartbeat 2.1.3 using the > >>> SuSe build service packages at > >>> http://download.opensuse.org/repositories/server:/ha-clustering/ > >>> but the problem seems persisting. I have two pgsql resources, > >>> using the stock install on my Fedora 6, i.e. > >>> pgdata=/var/lib/pgsql/data. > >>> Both are tied to their respective nodes, symmetric_cluster on, > >>> the constraints' score is -INFINITY for running them on the wrong node. > >>> The documentation of heartbeat said that for a symmetric cluster > >>> it's the way to bind a resource to a node (or to a set of nodes). > >>> The problem is that after the first pgsql resource is started > >>> successfully > >>> on the first node then the second pgsql resource is checked whether > >>> it's running on the first node - surprise, surprise, the system > >>> indicates > >>> that it does. As a consequence, it's marked as startup failed and > >>> heartbeat doesn't try to start it on the second node. Doing a cleanup > >>> on the failed second pgsql resource makes it start but now the first > >>> pgsql resource is marked failed. I guess because of the cleanup, > >>> the second pgsql is thuoght to be running on node1 and is stopped. > >>> The monitor action of the first resource notices that is's dead. > >>> Catch 22? > >>> > >>> Turning the configuration upside down (symmetric_cluster off > >>> and using +INFINITY rsc_location scores for binding to the correct > >>> node) didn't help. > >>> > >>> How can I solve this besides using a different PGDATA directory > >>> on the second node? The two machines is supposed to be configured > >>> identically regarding PostgreSQL. > >>> > >> > >> Attached is a patch for pgsql that supposedly fixes this issue. Please > >> test it and let me know the results. > > I tested it. Unfortunately, it doesn't work as intended. > Below is the main chunk inlined for reference and explanation. > > ============================== > @@ -267,7 +278,14 @@ > # > > pgsql_status() { > - pgrep -u $OCF_RESKEY_pgdba "postmaster|postgres" >/dev/null 2>&1 > + if [ -r $PIDFILE ] > + then > + PID=`head -n 1 $PIDFILE` > + kill -0 $PID 1>/dev/null 2>&1 > + return $? > + fi > + > + pgrep -u $OCF_RESKEY_pgdba -f "$OCF_RESKEY_status_pattern" > >/dev/null 2>&1 > } > > # > ============================== > > The above does this: > 1. If the PIDFILE is readable, extract the PID from the file > and test whether a process with that PID exists. > 2. Else it goes to the process list and looks for a process owned by PGDBA > and has the pattern (supposedly) in its name. However, it's not correct, > the pattern is searched in the whole command line. > > With your patch applied to the pgsql OCF RA in 2.1.3, I get this below. > Testing conditions: > - datadir1 is meant to be the PGDATA on the master > - datadir2 is meant to be the PGDATA on the slave > - no PostgreSQL instance running on the master > - there's a "su - postgres" shell open > - both commands are running on the master > > # OCF_ROOT=/usr/lib/ocf OCF_RESKEY_pgport=2345 \ > OCF_RESKEY_pgdba=postgres \ > OCF_RESKEY_pgdata=datadir1 \ > OCF_RESKEY_pgctl=path-to-pg_ctl \ > OCF_RESKEY_psql=path-to-psql \ > OCF_RESKEY_status_pattern="postgres|postmaster" \ > ./pgsql monitor ; echo $? > 2008/02/19_23:43:54 ERROR: PostgreSQL template1 isn't running > 1 > > # OCF_ROOT=/usr/lib/ocf OCF_RESKEY_pgport=2345 \ > OCF_RESKEY_pgdba=postgres \ > OCF_RESKEY_pgdata=datadir2 \ > OCF_RESKEY_pgctl=path-to-pg_ctl \ > OCF_RESKEY_psql=path-to-psql \ > OCF_RESKEY_status_pattern="postgres|postmaster" \ > ./pgsql monitor ; echo $? > 2008/02/19_23:43:54 ERROR: PostgreSQL template1 isn't running > 1 > > Your patch didn't find PIDFILE but pgrep found a matching process! > Here's the proof: > > # pgrep -u postgres -f "postgres|postmaster" > 22798 > # ps auxw | grep 22798 > postgres 22798 0.0 0.1 23788 1300 pts/2 S 23:46 0:00 su - > postgres > root 22860 0.0 0.1 3912 780 pts/5 S+ 23:49 0:00 grep 22798 > > When I log out from the "su - postgres" shell and still no PostgreSQL > instance running, I got this response from both commands above: > > 2008/02/19_23:49:31 INFO: PostgreSQL is down > 7 > > When I start pgsql with the OCF_RESKEY_pgdata=datadir1 > setting on the master and run the "pgsql monitor" commands above, > it returns 0 for both pgdata=datadir1 and pgdata=datadir2. > So, it still doesn't work despite I have set the two nodes' PGDATA > differently. However, using pg_ctl in pgsql_status() solves the problem > just as I described below. Patch is attached this time, it's the same as > inlined below. > > Best regards, > Zoltán Böszörményi > > > > Thanks, but why not simply use pg_ctl status? > > > > --- old_ocf/pgsql 2008-01-25 17:16:52.000000000 +0100 > > +++ pgsql 2008-02-11 06:15:28.000000000 +0100 > > @@ -267,7 +267,7 @@ > > # > > > > pgsql_status() { > > - pgrep -u $OCF_RESKEY_pgdba "postmaster|postgres" >/dev/null 2>&1 > > + runasowner "$OCF_RESKEY_pgctl -D $OCF_RESKEY_pgdata status > > >/dev/null 2>&1" > > } > > > > # > > > > This one above solved the problem for me. > > But it requires that I use different PGDATA on different nodes. > > It seems your patch isn't different in this regard. Your mods: > > 1. if $PIDFILE is readable, look at the PID it has > > 2. if not, look at the processes > > It still have problems with identical PGDATA directories. > > Do you have an idea how to distinguish in that case? > > > >> > >>> And a question about 2.1.3. After the upgrade, haclient couldn't > >>> connect > >>> because mgmtd wasn't started. I needed to add these two lines to ha.cf: > >>> > >>> apiauth mgmtd uid=root > >>> respawn root /usr/lib64/heartbeat/mgmtd -v > >>> > >>> Is it really needed? It wasn't for 2.0.8 and the docs say that > >>> it's not necessary since 2.0.5. Documentation got outdated again, > >>> or something broke? > >>> > >>> -- > >>> ---------------------------------- > >>> Zoltán Böszörményi > >>> Cybertec Schönig & Schönig GmbH > >>> http://www.postgresql.at/ > >>> > >>> > >>> _______________________________________________ > >>> > >>> Linux-HA mailing list > >>> [email protected] > >>> http://lists.linux-ha.org/mailman/listinfo/linux-ha > >>> See also: http://linux-ha.org/ReportingProblems > >>> > >>> > >> > >> > >> > >> > >> ------------------------------------------------------------------------ > >> > >> _______________________________________________ > >> Linux-HA mailing list > >> [email protected] > >> http://lists.linux-ha.org/mailman/listinfo/linux-ha > >> See also: http://linux-ha.org/ReportingProblems > > > > > > > -- > ---------------------------------- > Zoltán Böszörményi > Cybertec Schönig & Schönig GmbH > http://www.postgresql.at/ > > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > -- Serge Dubrouski. _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
