Re: [Linux-ha-dev] postfix ocf ra

Raoul Bhatia [IPAX] Tue, 23 Jun 2009 05:13:29 -0700

hi,

i'm reworking my script right now. commenting inline.


Dejan Muhamedagic wrote:
>> the first part snipped here has just been skipped by me.
>>
>>> isRunning()
>>> {
>>>     kill -0 "$1" 2>/dev/null
>>> }
>>
>> what is that supposed to prove?
>> it is only used after you checked for /proc/$pid/exe.
>> so $pid is an existing process, otherwise the (linux specific?) /proc/$pid/ 
>> would not exist.
>>
>> whether or not it is "running" (and not STOPPED or ZOMBIE),
>> kill -0 won't tell you either.
> 
> Right. Though in case of STOPPED we may still consider the
> resource running, since it may CONTinue at any time.

i used this as kill -0/isRunning() is used in several other ocf RAs.
i'm leaving this for now.

>>> # running() has been copied from debian's init script. we enhanced it a bit
>>> running() {
>>>     queue=$(postconf $OPTION_CONFIG_DIR -h queue_directory 2>/dev/null || 
>>> echo /var/spool/postfix)
>> if postconf -h queue_directory does not work, this is a broken
>> installation and should IMO not provide any other "default" value.

i c/p this from the debian init script. what shall we do in this case?

>>>     if [ -f ${queue}/pid/master.pid ]; then
>> just because a pid file does not exist, it is not running?
>> that is not exactly true.
> 
> Well spotted.

any suggestions on what to do? i think it is also not valid to
check for running master/qmgr/etc. processes as there might be
several postfix instances on the same server (think one local
postfix for e.g. cron and one clustered one)

>>>         pid=$(sed 's/ //g' ${queue}/pid/master.pid)
>> if you just use $pid (not "$pid"), shell did the whitespace stripping
>> for you, so this is unnecessary.
>> if you trust the input from that file (I think you should), just use $pid.
>> if you don't, you have an other problem a cluster manager will not
>> protect you against.

point taken, thank you!

>>>         # what directory does the executable live in.  stupid prelink 
>>> systems.
>>>         dir=$(ls -l /proc/$pid/exe 2>/dev/null | sed 's/.* -> //; 
>>> s/\/[^\/]*$//')
>>>         if [ "X$dir" = "X/usr/lib/postfix" ]; then
>> what is that supposed to do?
>> what does it guard against?
>> stale pidfile pointing to other process with recycled pid?
>> I don't like that.
> 
> That part flies anyway.

this has been removed.

>>> postfix_status()
>>> {
>>>     running
>>> }
>> maybe just do "postqueue -q 2>&1 | head -n1" and see if it reads
>> "postqueue: warning: Mail system is down -- accessing queue directly"
>> ?
> 
> This sounds like a very good idea.

mhm - shall the complete running() function be replaced then?
or shall this be an additional check?

i just (think i) verified that "rm master.pid" does not stop
postqueue -p from working.

>>> postfix_stop()
>>> {
>>>     if postfix_status; then
>> nope.
>> just stop it.
>> always.
>> do not do status first.
>>
>> because, crm (or is it only lrm?) will do that for you anyways.
>>
>> and because postfix stop is idempotent anyways.
>>
>> as you have it now,
>> if someone does rm master.pid while postfix is running,
>> postfix_status says false,
>> monitor will catch a failure, may attempt a restart on this node
>> so start is attempted, which fails,
>>
>> triggeres recovery action in crm for failed start, which -- correct me
>> if I'm wrong -- would try to stop it here, just in case, which would be
>> mapped to no-op because of the missing pidfile.
>>
>> occasionally it would do a monitor again, which would tell it "not running",
>> and because of the failed start (or failed monitor) on this node, it is
>> then started on an other node. depending on whether or not it is bound
>> to a specific IP, and whether or not that IP is controlled by CRM, this
>> may result in multiple instances running.
>> which, again depending on the configuration, may be a very bad thing,
>> or completely harmless.
> 
> True, but the monitor action must be fixed to always tell the
> truth. No harm in doing monitor before stop. If the RA logs the
> case when the resource has already been stopped, perhaps that can
> help sometimes with troubleshooting.

so, should it stay or should it go?

>>>         $binary $OPTIONS stop >/dev/null 2>&1
>>>         ret=$?
>>>
>>>         if [ $ret -ne 0 ]; then
>>>             ocf_log err "Postfix returned an error while stopping." $ret
>>>             exit $OCF_ERR_GENERIC
>>>         fi
>>>
>>>         # grant some time for shutdown and recheck
>>>         sleep 1
>>>         if postfix_status; then
>>>             ocf_log err "Postfix failed to stop."
>>>             exit $OCF_ERR_GENERIC
>>>         fi
>> if that should be necessary, you could escalate to "postfix abort", btw.
> 
> This is also good point. I think that at times the spool
> directory may reside on an NFS. I wonder if "abort" would work in
> that case too. Forgot to mention that it could be good thing to,
> as often it is done with start, to put monitor in a loop and wait
> for the stop to timeout in case there's a bit of housekeeping to
> be done.

a lot of scripts limit the retries to say 20 times so that we're
not stuck in an infinite loop. i would prefer this method.

anyways, i reworked this part and would like your feedback.
please wait until i finished my internal tests though.

>>> case $1 in
>>>     monitor)    postfix_monitor
>>>                 exit $?
>>>                 ;;
>>>     start)      postfix_start
>> inconsistent.
>> why hide the exit in the sub function for some functions, but not others?
> 
> Probably because monitor/status is used elsewhere.

yes. i am reworking this part though

>> Thanks for the effort!

thank you too!

cheers,
raoul
-- 
____________________________________________________________________
DI (FH) Raoul Bhatia M.Sc.          email.          [email protected]
Technischer Leiter

IPAX - Aloy Bhatia Hava OEG         web.          http://www.ipax.at
Barawitzkagasse 10/2/2/11           email.            [email protected]
1190 Wien                           tel.               +43 1 3670030
FN 277995t HG Wien                  fax.            +43 1 3670030 15
____________________________________________________________________
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] postfix ocf ra

Reply via email to