Re: [Linux-ha-dev] postfix ocf ra

Lars Ellenberg Thu, 28 May 2009 13:43:48 -0700

On Thu, May 28, 2009 at 08:33:15PM +0200, Raoul Bhatia [IPAX] wrote:

the first part snipped here has just been skipped by me.


> isRunning()
> {
>     kill -0 "$1" 2>/dev/null
> }


what is that supposed to prove?
it is only used after you checked for /proc/$pid/exe.
so $pid is an existing process, otherwise the (linux specific?) /proc/$pid/ 
would not exist.

whether or not it is "running" (and not STOPPED or ZOMBIE),
kill -0 won't tell you either.

> # running() has been copied from debian's init script. we enhanced it a bit
> running() {
>     queue=$(postconf $OPTION_CONFIG_DIR -h queue_directory 2>/dev/null || 
> echo /var/spool/postfix)

if postconf -h queue_directory does not work, this is a broken
installation and should IMO not provide any other "default" value.

>     if [ -f ${queue}/pid/master.pid ]; then

just because a pid file does not exist, it is not running?
that is not exactly true.

>         pid=$(sed 's/ //g' ${queue}/pid/master.pid)

if you just use $pid (not "$pid"), shell did the whitespace stripping
for you, so this is unnecessary.
if you trust the input from that file (I think you should), just use $pid.
if you don't, you have an other problem a cluster manager will not
protect you against.

>         # what directory does the executable live in.  stupid prelink systems.
>         dir=$(ls -l /proc/$pid/exe 2>/dev/null | sed 's/.* -> //; 
> s/\/[^\/]*$//')
>         if [ "X$dir" = "X/usr/lib/postfix" ]; then

what is that supposed to do?
what does it guard against?
stale pidfile pointing to other process with recycled pid?
I don't like that.

and it is not even correct, as you can reconfigure "daemon_directory",
so if you want to do that, you need to postconf -h daemon_directory
first.

>             if isRunning $pid; then

see above, kill -0 not necessary, as you just used /proc/pid/ !

>                 # @TODO why does "true" not work here?

because if you do not return here, you reach the "false" below,
which overrides this true ;)

>                 #true
>                 return 0
>             fi
>         fi
>     fi
> 
>     # Postfix is not running
>     false
> }
> 
> 
> postfix_status()
> {
>     running
> }

maybe just do "postqueue -q 2>&1 | head -n1" and see if it reads
"postqueue: warning: Mail system is down -- accessing queue directly"
?

> postfix_start()
> {
>     # if Postfix is running return success
>     if postfix_status; then
>         ocf_log info "Postfix already running."
>         exit $OCF_SUCCESS
>     fi
> 
>     # start Postfix
>     $binary $OPTIONS start >/dev/null 2>&1
>     ret=$?
>     if [ $ret -ne 0 ]; then
>         ocf_log err "Postfix returned error." $ret
>         exit $OCF_ERR_GENERIC
>     fi
> 
>     exit $OCF_SUCCESS
> }
> 
> 
> postfix_stop()
> {
>     if postfix_status; then

nope.
just stop it.
always.
do not do status first.

because, crm (or is it only lrm?) will do that for you anyways.

and because postfix stop is idempotent anyways.

as you have it now,
if someone does rm master.pid while postfix is running,
postfix_status says false,
monitor will catch a failure, may attempt a restart on this node
so start is attempted, which fails,
triggeres recovery action in crm for failed start, which -- correct me
if I'm wrong -- would try to stop it here, just in case, which would be
mapped to no-op because of the missing pidfile.

occasionally it would do a monitor again, which would tell it "not running",
and because of the failed start (or failed monitor) on this node, it is
then started on an other node. depending on whether or not it is bound
to a specific IP, and whether or not that IP is controlled by CRM, this
may result in multiple instances running.
which, again depending on the configuration, may be a very bad thing,
or completely harmless.


>         $binary $OPTIONS stop >/dev/null 2>&1
>         ret=$?
> 
>         if [ $ret -ne 0 ]; then
>             ocf_log err "Postfix returned an error while stopping." $ret
>             exit $OCF_ERR_GENERIC
>         fi
> 
>         # grant some time for shutdown and recheck
>         sleep 1
>         if postfix_status; then
>             ocf_log err "Postfix failed to stop."
>             exit $OCF_ERR_GENERIC
>         fi

if that should be necessary, you could escalate to "postfix abort", btw.


>     fi
> 
>     exit $OCF_SUCCESS
> }
> 
> postfix_reload()
> {
>     if postfix_status; then
>         ocf_log info "Reloading Postfix."
>         $binary $OPTIONS reload
>     fi

again: why do the "if running"?
crm should not even try to "reload" a "supposedly stopped" resource.
so it is expected to be running, if reload fails because it is actually
NOT running, then that is even better as it probably will trigger a
stop/start instead, which should clean it all up.

> }
> 
> postfix_monitor()
> {
>     if postfix_status; then
>         return $OCF_SUCCESS
>     fi
> 
>     return $OCF_NOT_RUNNING
> }
> 
> postfix_validate_all()
> {
>     # check that the Postfix binary exists and can be executed
>     if [ ! -x "$binary" ]; then
>         ocf_log err "Postfix binary '$binary' does not exist or cannot be 
> executed."
>         return $OCF_ERR_GENERIC
>     fi
> 
>     # check config_dir and alternate_config_directories parameter
>     if [ "x$config_dir" != "x" ]; then
>         if [ ! -d "$config_dir" ]; then
>             ocf_log err "Postfix configuration directory '$config_dir' does 
> not exist." $ret
>             return $OCF_ERR_GENERIC
>         fi
> 
>         alternate_config_directories=$(postconf -h 
> alternate_config_directories 2>/dev/null | grep $config_dir)
>         if [ "x$alternate_config_directories" = "x" ]; then
>             ocf_log err "Postfix main configuration must contain correct 
> 'alternate_config_directories' parameter."
>             return $OCF_ERR_GENERIC
>         fi
>     fi

Dejan already complained about all the GENERIC here, probably better
INSTALLED or CONFIGURED is more appropriate.

>     # check spool/queue directory
>     queue=$(postconf $OPTION_CONFIG_DIR -h queue_directory 2>/dev/null || 
> echo /var/spool/postfix)
>     if [ ! -d "$queue" ]; then
>         ocf_log err "Postfix spool/queue directory '$queue' does not exist." 
> $ret
>         return $OCF_ERR_GENERIC
>     fi
> 
>     # run postfix internal check
>     $binary $OPTIONS check >/dev/null 2>&1
>     ret=$?
>     if [ $ret -ne 0 ]; then
>         ocf_log err "Postfix check failed." $ret
>         return $OCF_ERR_GENERIC
>     fi
> 
>     return $OCF_SUCCESS
> }
> 
> #
> # Main
> #
> 
> if [ $# -ne 1 ]; then
>     usage
>     exit $OCF_ERR_ARGS
> fi
> 
> binary=$OCF_RESKEY_binary
> config_dir=$OCF_RESKEY_config_dir
> parameters=$OCF_RESKEY_parameters
> 
> # debugging stuff
> #echo OCF_RESKEY_binary=$OCF_RESKEY_binary >> 
> /tmp/prox_conf_$OCF_RESOURCE_INSTANCE
> #echo OCF_RESKEY_config_dir=$OCF_RESKEY_config_dir >> 
> /tmp/prox_conf_$OCF_RESOURCE_INSTANCE
> #echo OCF_RESKEY_parameters=$OCF_RESKEY_parameters >> 
> /tmp/prox_conf_$OCF_RESOURCE_INSTANCE
> 
> 
> # build postfix options string *outside* to access from each method
> OPTIONS=''
> OPTION_CONFIG_DIR=''
> 
> # check if the Postfix config_dir exist
> if [ "x$config_dir" != "x" ]; then
>     # save OPTION_CONFIG_DIR seperatly
>     OPTION_CONFIG_DIR="-c $config_dir"
>     OPTIONS=$OPTION_CONFIG_DIR
> fi
> 
> if [ "x$parameters" != "x" ]; then
>     OPTIONS="$OPTIONS $parameters"
> fi
> 
> case $1 in
>     meta-data)  meta_data
>                 exit $OCF_SUCCESS
>                 ;;
> 
>     usage|help) usage
>                 exit $OCF_SUCCESS
>                 ;;
> esac
> 
> postfix_validate_all
> ret=$?
> 
> #echo "[$1:$ret]"
> LSB_STATUS_STOPPED=3
> if [ $ret -ne $OCF_SUCCESS ]; then
>     case $1 in
>     stop)       exit $OCF_SUCCESS ;;
>     monitor)    exit $OCF_NOT_RUNNING;;
>     status)     exit $LSB_STATUS_STOPPED;;
>     *)          exit $ret;;
>     esac
> fi
> 
> case $1 in
>     monitor)    postfix_monitor
>                 exit $?
>                 ;;
>     start)      postfix_start

inconsistent.
why hide the exit in the sub function for some functions, but not others?

>                 ;;
> 
>     stop)       postfix_stop
>                 ;;
> 
>     reload)     postfix_reload
>                 ;;
> 
>     status)     if postfix_status; then
>                     ocf_log info "Postfix is running."
>                     exit $OCF_SUCCESS
>                 else
>                     ocf_log info "Postfix is stopped."
>                     exit $OCF_NOT_RUNNING
>                 fi
>                 ;;
> 
>     monitor)    postfix_monitor
>                 exit $?
>                 ;;

oops. double monitor.

>     validate-all)   exit $OCF_SUCCESS
>                     ;;
> 
>     *)          usage
>                 exit $OCF_ERR_UNIMPLEMENTED
>                 ;;
> esac

Still with me?
Thanks for the effort!

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] postfix ocf ra

Reply via email to