Re: [Linux-ha-dev] postfix ocf ra

Dejan Muhamedagic Fri, 29 May 2009 02:58:20 -0700

Hi Lars,

On Thu, May 28, 2009 at 10:43:37PM +0200, Lars Ellenberg wrote:
> On Thu, May 28, 2009 at 08:33:15PM +0200, Raoul Bhatia [IPAX] wrote:
> 
> the first part snipped here has just been skipped by me.
> 
> > isRunning()
> > {
> >     kill -0 "$1" 2>/dev/null
> > }
> 
> 
> what is that supposed to prove?
> it is only used after you checked for /proc/$pid/exe.
> so $pid is an existing process, otherwise the (linux specific?) /proc/$pid/ 
> would not exist.
> 
> whether or not it is "running" (and not STOPPED or ZOMBIE),
> kill -0 won't tell you either.


Right. Though in case of STOPPED we may still consider the
resource running, since it may CONTinue at any time.

> > # running() has been copied from debian's init script. we enhanced it a bit
> > running() {
> >     queue=$(postconf $OPTION_CONFIG_DIR -h queue_directory 2>/dev/null || 
> > echo /var/spool/postfix)
> 
> if postconf -h queue_directory does not work, this is a broken
> installation and should IMO not provide any other "default" value.
> 
> >     if [ -f ${queue}/pid/master.pid ]; then
> 
> just because a pid file does not exist, it is not running?
> that is not exactly true.

Well spotted.

> >         pid=$(sed 's/ //g' ${queue}/pid/master.pid)
> 
> if you just use $pid (not "$pid"), shell did the whitespace stripping
> for you, so this is unnecessary.
> if you trust the input from that file (I think you should), just use $pid.
> if you don't, you have an other problem a cluster manager will not
> protect you against.
> 
> >         # what directory does the executable live in.  stupid prelink 
> > systems.
> >         dir=$(ls -l /proc/$pid/exe 2>/dev/null | sed 's/.* -> //; 
> > s/\/[^\/]*$//')
> >         if [ "X$dir" = "X/usr/lib/postfix" ]; then
> 
> what is that supposed to do?
> what does it guard against?
> stale pidfile pointing to other process with recycled pid?
> I don't like that.

That part flies anyway.

[...]

> > postfix_status()
> > {
> >     running
> > }
> 
> maybe just do "postqueue -q 2>&1 | head -n1" and see if it reads
> "postqueue: warning: Mail system is down -- accessing queue directly"
> ?

This sounds like a very good idea.

> > postfix_start()
> > {
> >     # if Postfix is running return success
> >     if postfix_status; then
> >         ocf_log info "Postfix already running."
> >         exit $OCF_SUCCESS
> >     fi
> > 
> >     # start Postfix
> >     $binary $OPTIONS start >/dev/null 2>&1
> >     ret=$?
> >     if [ $ret -ne 0 ]; then
> >         ocf_log err "Postfix returned error." $ret
> >         exit $OCF_ERR_GENERIC
> >     fi
> > 
> >     exit $OCF_SUCCESS
> > }
> > 
> > 
> > postfix_stop()
> > {
> >     if postfix_status; then
> 
> nope.
> just stop it.
> always.
> do not do status first.
> 
> because, crm (or is it only lrm?) will do that for you anyways.
> 
> and because postfix stop is idempotent anyways.
> 
> as you have it now,
> if someone does rm master.pid while postfix is running,
> postfix_status says false,
> monitor will catch a failure, may attempt a restart on this node
> so start is attempted, which fails,
>
> triggeres recovery action in crm for failed start, which -- correct me
> if I'm wrong -- would try to stop it here, just in case, which would be
> mapped to no-op because of the missing pidfile.
> 
> occasionally it would do a monitor again, which would tell it "not running",
> and because of the failed start (or failed monitor) on this node, it is
> then started on an other node. depending on whether or not it is bound
> to a specific IP, and whether or not that IP is controlled by CRM, this
> may result in multiple instances running.
> which, again depending on the configuration, may be a very bad thing,
> or completely harmless.

True, but the monitor action must be fixed to always tell the
truth. No harm in doing monitor before stop. If the RA logs the
case when the resource has already been stopped, perhaps that can
help sometimes with troubleshooting.

> >         $binary $OPTIONS stop >/dev/null 2>&1
> >         ret=$?
> > 
> >         if [ $ret -ne 0 ]; then
> >             ocf_log err "Postfix returned an error while stopping." $ret
> >             exit $OCF_ERR_GENERIC
> >         fi
> > 
> >         # grant some time for shutdown and recheck
> >         sleep 1
> >         if postfix_status; then
> >             ocf_log err "Postfix failed to stop."
> >             exit $OCF_ERR_GENERIC
> >         fi
> 
> if that should be necessary, you could escalate to "postfix abort", btw.

This is also good point. I think that at times the spool
directory may reside on an NFS. I wonder if "abort" would work in
that case too. Forgot to mention that it could be good thing to,
as often it is done with start, to put monitor in a loop and wait
for the stop to timeout in case there's a bit of housekeeping to
be done.

> > case $1 in
> >     monitor)    postfix_monitor
> >                 exit $?
> >                 ;;
> >     start)      postfix_start
> 
> inconsistent.
> why hide the exit in the sub function for some functions, but not others?

Probably because monitor/status is used elsewhere.

[...]

> Still with me?

Hope so :)

> Thanks for the effort!

Cheers,

Dejan

> 
> -- 
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
> 
> DRBD? and LINBIT? are registered trademarks of LINBIT, Austria.
> _______________________________________________________
> Linux-HA-Dev: [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] postfix ocf ra

Reply via email to