On 2008-10-21T10:56:41, Xinwei Hu <[EMAIL PROTECTED]> wrote:
Hi Xinwei,
thanks for posting this RA. Looks pretty good, a few comments:
- I assume this does "only" do NFS fail-over of the whole server, not
single exports, and not exporting the same fs from several nodes
concurrently?
(If it can't, it probably should return ERR_CONFIGURED if run as a
clone.)
- How is it stacked with IPaddr? It seems to require the IPaddr to be
online before - but if IPaddr is running w/o the server being up yet,
there's a time where clients would get connection refused errors.
> #!/bin/sh
Really sh, or bash? (bash is fine, but if so, it should state that.) I
am hazy on what is bash specific and what isn't, so I thought it's safer
to ask ;-)
> # nfsserver
> #
> # Description: Manages nfs server as OCF resource
> # by [EMAIL PROTECTED]
> # License: GNU General Public License 2 (GPL2)
Why GPLv2 only, and not GPLv2 and later?
> if [ -n "$OCF_DEBUG_LIBRARY" ]; then
> . $OCF_DEBUG_LIBRARY
> else
> . ${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs
> fi
I wonder where that snippet comes from; do we actually have a debug
library anywhere? I've seen it in one or two RAs, but just curious.
> <parameter name="nfs_ip" unique="0" required="1">
> <longdesc lang="en">
> The floating IP address used to access the the nfs service
> </longdesc>
> <shortdesc lang="en">
> IP address.
> </shortdesc>
> <content type="string" default="*" />
If this was called "ip" instead of "nfs_ip", it could reference the
IPaddr parameter in the CIB (possible with pacemaker-1.0) and only be
specified once.
(We probably want to eventually have hints like that for the GUI to
automatically support this ...)
> <actions>
> <action name="start" timeout="40" />
> <action name="stop" timeout="10" />
> <action name="status" timeout="90" />
status doesn't need to be specified.
> <action name="monitor" depth="0" timeout="20" interval="10"
> start-delay="1m"/>
Why a start-delay?
> nfsserver_monitor ()
> {
> ${OCF_RESKEY_nfs_init_script} status > /dev/null 2>&1
Should the output be logged?
> rc=$?
>
> #Adapte LSB status code to OCF return code
> if [ $rc -eq 0 ]; then
> return $OCF_SUCCESS
> elif [ $rc -eq 3 ]; then
> return $OCF_NOT_RUNNING
> else
> return $OCF_ERR_GENERIC
> fi
> }
Should this handle "unused" as a possible return code translated to
ERR_INSTALLED or ERR_CONFIGURED?
> prepare_directory ()
> {
> fp="$OCF_RESKEY_nfs_shared_infodir"
fp is assigned in many places; this should be checked & set as part of
the general init sequence of the script (see below).
>
> [ -d "$fp" ] || mkdir -p $fp
> [ -d "$fp/rpc_pipefs" ] || mkdir -p $fp/rpc_pipefs
> [ -d "$fp/sm" ] || mkdir -p $fp/sm
> [ -d "$fp/sm.ha" ] || mkdir -p $fp/sm.ha
> [ -d "$fp/sm.bak" ] || mkdir -p $fp/sm.bak
> [ -d "$fp/v4recovery" ] || mkdir -p $fp/v4recovery
> }
>
> is_bound ()
> {
> mount | grep -q "$1 on $2 type none (.*bind)"
> return $?
> }
I have no perfect proof, but this strikes me as fragile, ie if bind is
not the last option etc.
> prepare_tree ()
> {
> fp="$OCF_RESKEY_nfs_shared_infodir"
> if is_bound $fp /var/lib/nfs; then
> ocf_log debug "$fp is already bound to /var/lib/nfs"
> return 0
> fi
> mount --bind $fp /var/lib/nfs
> }
Why a bind mount instead of possibly symlinking the shared space to
this, or directly mounting the shared fs or something else?
(No criticism; just trying to understand the reasons.)
> unprepare_tree ()
> {
Better name might be to call these "bind/unbind" or "take/release", as
unprepare doesn't really exist.
> fp="$OCF_RESKEY_nfs_shared_infodir"
> if is_bound $fp /var/lib/nfs; then
> umount /var/lib/nfs
> fi
> }
>
> nfsserver_start ()
> {
> prepare_directory
> prepare_tree
>
> cp -f /var/lib/nfs/sm/* /var/lib/nfs/sm.ha 2>&1 > /dev/null
What about left-over files in sm.ha? Should that directory be cleaned
out first?
> ocf_log info "Starting NFS server ..."
> ${OCF_RESKEY_nfs_init_script} start 2>&1 > /dev/null
Should the output here be logged?
> rc=$?
> if [ $rc -ne 0 ]; then
> ocf_log err "Failed to start NFS server"
> return $rc
> fi
>
> #Notify the nfs server has been moved or rebooted
> #The init script do that already, but with the hostname, which may be
> ignored by client
> #we have to do it again with the nfs_ip
> ${OCF_RESKEY_nfs_notify_cmd} -v $OCF_RESKEY_nfs_ip -P /var/lib/nfs/sm.ha
> rm -f /var/lib/nfs/sm.ha/* 2>&1 > /dev/null
>
> ocf_log info "NFS server started"
> return $OCF_SUCCESS
> }
What happens when this is called twice? Does it pass ocf-tester?
>
> nfsserver_stop ()
> {
> ocf_log info "Stopping NFS server ..."
> ${OCF_RESKEY_nfs_init_script} stop 2>&1 > /dev/null
Should this output be logged?
> rc=$?
> if [ $rc -eq 0 ]; then
> unprepare_tree
> ocf_log info "NFS server stopped"
> return $OCF_SUCCESS
> fi
> ocf_log err "Failed to stop NFS server"
> return $rc
> }
>
> nfsserver_validate ()
> {
> if [ -z ${OCF_RESKEY_nfs_ip} ]; then
> return $OCF_ERR_ARGS
> fi
>
> if [ -x ${OCF_RESKEY_nfs_init_script} && -x
> ${OCF_RESKEY_nfs_notify_cmd} ]; then
> if [ -d $OCF_RESKEY_nfs_shared_infodir ]; then
> return $OCF_SUCCESS
> else
> return $OCF_ERR_ARGS
> else
> return $OCF_ERR_INSTALLED
> fi
> }
I think you always need to call nfsserver_validate() in yout generic
init sequence for this script, before attempting to start/stop/monitor.
Otherwise, the initial probe will fail and throw the cluster into a
reboot cycle if the binary is incorrectly specified.
For checking for the binaries, "check_binary" is likely your friend.
> case $__OCF_ACTION in
> start) nfsserver_start
> ;;
> stop) nfsserver_stop
> ;;
> monitor) nfsserver_monitor
> ;;
> validate-all) nfsserver_validate
> ;;
> *) nfsserver_usage
> exit $OCF_ERR_UNIMPLEMENTED
> ;;
> esac
Regards,
Lars
--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/