hi,
i'm reworking my script right now. commenting inline.
Dejan Muhamedagic wrote:
>> the first part snipped here has just been skipped by me.
>>
>>> isRunning()
>>> {
>>> kill -0 "$1" 2>/dev/null
>>> }
>>
>> what is that supposed to prove?
>> it is only used after you checked for /proc/$pid/exe.
>> so $pid is an existing process, otherwise the (linux specific?) /proc/$pid/
>> would not exist.
>>
>> whether or not it is "running" (and not STOPPED or ZOMBIE),
>> kill -0 won't tell you either.
>
> Right. Though in case of STOPPED we may still consider the
> resource running, since it may CONTinue at any time.
i used this as kill -0/isRunning() is used in several other ocf RAs.
i'm leaving this for now.
>>> # running() has been copied from debian's init script. we enhanced it a bit
>>> running() {
>>> queue=$(postconf $OPTION_CONFIG_DIR -h queue_directory 2>/dev/null ||
>>> echo /var/spool/postfix)
>> if postconf -h queue_directory does not work, this is a broken
>> installation and should IMO not provide any other "default" value.
i c/p this from the debian init script. what shall we do in this case?
>>> if [ -f ${queue}/pid/master.pid ]; then
>> just because a pid file does not exist, it is not running?
>> that is not exactly true.
>
> Well spotted.
any suggestions on what to do? i think it is also not valid to
check for running master/qmgr/etc. processes as there might be
several postfix instances on the same server (think one local
postfix for e.g. cron and one clustered one)
>>> pid=$(sed 's/ //g' ${queue}/pid/master.pid)
>> if you just use $pid (not "$pid"), shell did the whitespace stripping
>> for you, so this is unnecessary.
>> if you trust the input from that file (I think you should), just use $pid.
>> if you don't, you have an other problem a cluster manager will not
>> protect you against.
point taken, thank you!
>>> # what directory does the executable live in. stupid prelink
>>> systems.
>>> dir=$(ls -l /proc/$pid/exe 2>/dev/null | sed 's/.* -> //;
>>> s/\/[^\/]*$//')
>>> if [ "X$dir" = "X/usr/lib/postfix" ]; then
>> what is that supposed to do?
>> what does it guard against?
>> stale pidfile pointing to other process with recycled pid?
>> I don't like that.
>
> That part flies anyway.
this has been removed.
>>> postfix_status()
>>> {
>>> running
>>> }
>> maybe just do "postqueue -q 2>&1 | head -n1" and see if it reads
>> "postqueue: warning: Mail system is down -- accessing queue directly"
>> ?
>
> This sounds like a very good idea.
mhm - shall the complete running() function be replaced then?
or shall this be an additional check?
i just (think i) verified that "rm master.pid" does not stop
postqueue -p from working.
>>> postfix_stop()
>>> {
>>> if postfix_status; then
>> nope.
>> just stop it.
>> always.
>> do not do status first.
>>
>> because, crm (or is it only lrm?) will do that for you anyways.
>>
>> and because postfix stop is idempotent anyways.
>>
>> as you have it now,
>> if someone does rm master.pid while postfix is running,
>> postfix_status says false,
>> monitor will catch a failure, may attempt a restart on this node
>> so start is attempted, which fails,
>>
>> triggeres recovery action in crm for failed start, which -- correct me
>> if I'm wrong -- would try to stop it here, just in case, which would be
>> mapped to no-op because of the missing pidfile.
>>
>> occasionally it would do a monitor again, which would tell it "not running",
>> and because of the failed start (or failed monitor) on this node, it is
>> then started on an other node. depending on whether or not it is bound
>> to a specific IP, and whether or not that IP is controlled by CRM, this
>> may result in multiple instances running.
>> which, again depending on the configuration, may be a very bad thing,
>> or completely harmless.
>
> True, but the monitor action must be fixed to always tell the
> truth. No harm in doing monitor before stop. If the RA logs the
> case when the resource has already been stopped, perhaps that can
> help sometimes with troubleshooting.
so, should it stay or should it go?
>>> $binary $OPTIONS stop >/dev/null 2>&1
>>> ret=$?
>>>
>>> if [ $ret -ne 0 ]; then
>>> ocf_log err "Postfix returned an error while stopping." $ret
>>> exit $OCF_ERR_GENERIC
>>> fi
>>>
>>> # grant some time for shutdown and recheck
>>> sleep 1
>>> if postfix_status; then
>>> ocf_log err "Postfix failed to stop."
>>> exit $OCF_ERR_GENERIC
>>> fi
>> if that should be necessary, you could escalate to "postfix abort", btw.
>
> This is also good point. I think that at times the spool
> directory may reside on an NFS. I wonder if "abort" would work in
> that case too. Forgot to mention that it could be good thing to,
> as often it is done with start, to put monitor in a loop and wait
> for the stop to timeout in case there's a bit of housekeeping to
> be done.
a lot of scripts limit the retries to say 20 times so that we're
not stuck in an infinite loop. i would prefer this method.
anyways, i reworked this part and would like your feedback.
please wait until i finished my internal tests though.
>>> case $1 in
>>> monitor) postfix_monitor
>>> exit $?
>>> ;;
>>> start) postfix_start
>> inconsistent.
>> why hide the exit in the sub function for some functions, but not others?
>
> Probably because monitor/status is used elsewhere.
yes. i am reworking this part though
>> Thanks for the effort!
thank you too!
cheers,
raoul
--
____________________________________________________________________
DI (FH) Raoul Bhatia M.Sc. email. [email protected]
Technischer Leiter
IPAX - Aloy Bhatia Hava OEG web. http://www.ipax.at
Barawitzkagasse 10/2/2/11 email. [email protected]
1190 Wien tel. +43 1 3670030
FN 277995t HG Wien fax. +43 1 3670030 15
____________________________________________________________________
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/