I'm playing with CAP_KILL, CAP_SYS_BOOT and PR_SET_KEEPCAP.
Will get back on this patch tomorrow.

Cheers.

> -----Original Message-----
> From: Mathivanan Naickan Palanivelu
> Sent: Thursday, June 13, 2013 8:53 PM
> To: Anders Widell; Hans Feldt
> Cc: [email protected]
> Subject: RE: [devel] [PATCH 1 of 1] osaf: Add time supervision of
> opensaf_reboot [#437]
> 
> I was also looking at CAP_SYS_ADMIN(alternative for
> opensaf_reboot_prepare()), as an option until I ran into
> https://lwn.net/Articles/486306/!
> CAP_SYS_ADMIN would make us vulnerable.
> 
> Cheers,
> Mathi.
> 
> 
> > -----Original Message-----
> > From: Anders Widell [mailto:[email protected]]
> > Sent: Tuesday, June 11, 2013 1:56 PM
> > To: Hans Feldt
> > Cc: [email protected]
> > Subject: Re: [devel] [PATCH 1 of 1] osaf: Add time supervision of
> > opensaf_reboot [#437]
> >
> > Maybe I should also point out that the case of getting file
> > descriptors 0, 1 or 2 is not just a hypothetical scenario that I have
> > dreamed up - it actually happens. The code will not work without the retry.
> >
> > I will try to make the code comments more clear, maybe mention the
> > daemonize() function instead of just referring to "dropping root 
> > privileges".
> >
> > regards,
> > Anders Widell
> >
> > On 2013-06-10 17:04, Anders Widell wrote:
> > > See comments below.
> > >
> > > regards,
> > > Anders Widell
> > >
> > > On 2013-06-10 15:47, Hans Feldt wrote:
> > >> Why is not opensaf_reboot_prepare() called from all contexts?
> > > What do you mean by all contexts? It is called by amfwd since it
> > > needs to reboot the local node without running as root. As I said in
> > > the review mail (but maybe also should go into the commit message),
> > > amfd and fmd can simply _Exit() to reboot the local node. This can
> > > be a separate enhancement ticket, since it works already now
> > > (opensaf_reboot() will exit when the timer has expired).
> > >> I think the implementation of opensaf_reboot_prepare() requires
> > >> some comments since it does recursion. I think I understand it but
> > >> it is just a little to clever to be uncommented...
> > > I did put a comment just at the point of recursive call, but maybe
> > > it wasn't clear enough? :-) Basically, I don't want to get file
> > > descriptors 0, 1 or 2. So if I do get one of those I try again.
> > >> Why is opensaf_reboot_prepare() called before daemonize()? I guess
> > >> that should be commented since it is probably important.
> > > The comment for opensaf_reboot_prepare() says that it must be called
> > > before dropping root privileges. daemonize() is the function that
> > > drops root privileges, so I think it is fairly clear why it is
> > > called before daemonize().
> > >> Thanks,
> > >> Hans
> > >>
> > >>
> > >> On 06/10/2013 12:50 PM, Anders Widell wrote:
> > >>> 00-README.conf                                   |   5 +
> > >>>    osaf/libs/core/include/ncssysf_def.h             |  20 ++++-
> > >>>    osaf/libs/core/leap/sysf_def.c                   |  93
> > >>> ++++++++++++++++++++++-
> > >>>    osaf/services/infrastructure/nid/config/nid.conf |   6 +
> > >>>    osaf/services/saf/avsv/amfwdog/amf_wdog.c        |   1 +
> > >>>    scripts/opensaf_reboot                           |  10 ++-
> > >>>    6 files changed, 126 insertions(+), 9 deletions(-)
> > >>>
> > >>>
> > >>> Add a time supervision of the library function opensaf_reboot() as
> > >>> well as the shell script opensaf_reboot. If the reboot has not
> > >>> happened before the timeout, the OS is rebooted hard using the
> > >>> SysRq trigger /proc/sysrq-trigger.
> > >>> This makes
> > >>> it possible to reboot the node also when the system is in a very
> > >>> bad state, for example when fork() fails because the system is out
> > >>> of resources (no free memory, process table full etc.).  It also
> > >>> handles the case when the ordinary reboot command hangs trying to
> > >>> sync the file system, for example due to a disk or NFS problem.
> > >>>
> > >>> diff --git a/00-README.conf b/00-README.conf
> > >>> --- a/00-README.conf
> > >>> +++ b/00-README.conf
> > >>> @@ -52,6 +52,11 @@ group/user.
> > >>>
> > >>>    - Use of MDS subslot ID needs to be enabled, add
> > >>> TIPC_USE_SUBSLOT_ID=YES
> > >>>
> > >>> +- Time supervision of local node reboot should be disabled or
> > >>> changed.  Change
> > >>> +  OPENSAF_REBOOT_TIMEOUT to the desired number of seconds
> > before a
> > >>> reboot is
> > >>> +  escalated to an immediate reboot via the SysRq interface, or
> > >>> + zero
> > >>> to disable
> > >>> +  this feature.
> > >>> +
> > >>>
> >
> **********************************************************
> > *********************
> > >>>    nodeinit.conf
> > >>>
> > >>> diff --git a/osaf/libs/core/include/ncssysf_def.h
> > >>> b/osaf/libs/core/include/ncssysf_def.h
> > >>> --- a/osaf/libs/core/include/ncssysf_def.h
> > >>> +++ b/osaf/libs/core/include/ncssysf_def.h
> > >>> @@ -83,7 +83,25 @@ extern "C" {
> > >>>    #define m_START_CRITICAL m_NCS_OS_START_TASK_LOCK
> > >>>    #define m_END_CRITICAL                 m_NCS_OS_END_TASK_LOCK
> > >>>
> > >>> -extern void opensaf_reboot(unsigned int node_id, char *ee_name,
> > >>> const char *reason);
> > >>> +/**
> > >>> + *  Prepare for a future call to opensaf_reboot() by opening the
> > >>> necessary
> > >>> + *  file (/proc/sysrq-trigger). Call this function before
> > >>> + dropping root
> > >>> + *  privileges, if you later intend to call opensaf_reboot() to
> > >>> reboot the local
> > >>> + *  node without having root privileges.
> > >>> + */
> > >>> +void opensaf_reboot_prepare(void);
> > >>> +
> > >>> +/**
> > >>> + *  Reboot a node. Call this function with @a node_id zero to
> > >>> +reboot
> > >>> the local
> > >>> + *  node. If you intend to use this function to reboot the local
> > >>> node without
> > >>> + *  having root privileges, you must first call
> > >>> opensaf_reboot_prepare() before
> > >>> + *  dropping root privileges.
> > >>> + *
> > >>> + *  Note that this function uses the configuration option
> > >>> OPENSAF_REBOOT_TIMEOUT
> > >>> + *  in nid.conf. Therefore, this function must only be called
> > >>> + from
> > >>> services
> > >>> + *  that are started by NID.
> > >>> + */
> > >>> +void opensaf_reboot(unsigned node_id, const char* ee_name, const
> > >>> char* reason);
> > >>>
> > >>>
> >
> /**********************************************************
> > *********
> > >>> **********
> > >>> ** **
> > >>> diff --git a/osaf/libs/core/leap/sysf_def.c
> > >>> b/osaf/libs/core/leap/sysf_def.c
> > >>> --- a/osaf/libs/core/leap/sysf_def.c
> > >>> +++ b/osaf/libs/core/leap/sysf_def.c
> > >>> @@ -26,7 +26,17 @@
> > >>>
> > >>>    #include <configmake.h>
> > >>>
> > >>> -#include <ncsgl_defs.h>
> > >>> +#include <stdio.h>
> > >>> +#include <errno.h>
> > >>> +#include <stdlib.h>
> > >>> +#include <stdbool.h>
> > >>> +#include <sys/stat.h>
> > >>> +#include <fcntl.h>
> > >>> +#include <unistd.h>
> > >>> +#include <signal.h>
> > >>> +#include <syslog.h>
> > >>> +#include "ncs_main_papi.h"
> > >>> +#include "ncsgl_defs.h"
> > >>>    #include "ncs_osprm.h"
> > >>>
> > >>>    #include "ncs_svd.h"
> > >>> @@ -38,6 +48,7 @@
> > >>>    #include "sysf_exc_scr.h"
> > >>>    #include "usrbuf.h"
> > >>>
> > >>> +static int sysrq_trigger_fd = -1;
> > >>>
> > >>>
> >
> /**********************************************************
> > *********
> > >>> **********
> > >>>
> > >>> @@ -271,20 +282,88 @@ uint32_t leap_env_destroy()
> > >>>        return NCSCC_RC_SUCCESS;
> > >>>    }
> > >>>
> > >>> +void opensaf_reboot_prepare(void) {
> > >>> +    if (sysrq_trigger_fd != -1) return;
> > >>> +    int fd;
> > >>> +    do {
> > >>> +        fd = open("/proc/sysrq-trigger", O_WRONLY);
> > >>> +    } while (fd == -1 && errno == EINTR);
> > >>> +    if (fd >= 0 && fd <= 2) {
> > >>> +        /* We don't want to get file descriptors 0, 1 or 2 because:
> > >>> +         *   1) it would be dangerous
> > >>> +         *   2) it would by closed by deamonize()
> > >>> +         */
> > >>> +        opensaf_reboot_prepare();
> > >>> +        close(fd);
> > >>> +    } else {
> > >>> +        sysrq_trigger_fd = fd;
> > >>> +    }
> > >>> +}
> > >>> +
> > >>> +static void opensaf_reboot_fallback(int sig_no) {
> > >>> +    (void) sig_no;
> > >>> +    if (sysrq_trigger_fd == -1) {
> > >>> +        do {
> > >>> +            sysrq_trigger_fd = open("/proc/sysrq-trigger", O_WRONLY);
> > >>> +        } while (sysrq_trigger_fd == -1 && errno == EINTR);
> > >>> +    }
> > >>> +    if (sysrq_trigger_fd != -1) {
> > >>> +        char buf[] = {'b'};
> > >>> +        ssize_t result;
> > >>> +        do {
> > >>> +            result = write(sysrq_trigger_fd, buf, sizeof(buf));
> > >>> +        } while (result == -1 && errno == EINTR);
> > >>> +    }
> > >>> +    _Exit(EXIT_SUCCESS);
> > >>> +}
> > >>> +
> > >>>    /**
> > >>>     *
> > >>>     * @param reason
> > >>>     */
> > >>> -void opensaf_reboot(unsigned int node_id, char *ee_name, const
> > >>> char
> > >>> *reason)
> > >>> +void opensaf_reboot(unsigned node_id, const char* ee_name, const
> > >>> char* reason)
> > >>>    {
> > >>> +    char* env_var = getenv("OPENSAF_REBOOT_TIMEOUT");
> > >>> +    unsigned long supervision_time = 0;
> > >>> +    if (env_var != NULL) {
> > >>> +        char* endptr;
> > >>> +        errno = 0;
> > >>> +        supervision_time = strtoul(env_var, &endptr, 0);
> > >>> +        if (errno != 0 || *env_var == '\0' || *endptr != '\0') {
> > >>> +            supervision_time = 0;
> > >>> +        }
> > >>> +    }
> > >>> +
> > >>> +    unsigned own_node_id = ncs_get_node_id();
> > >>> +    bool use_fallback = supervision_time > 0 && (node_id == 0 ||
> > >>> node_id ==
> > >>> +        own_node_id);
> > >>> +    if (use_fallback) {
> > >>> +        if (signal(SIGALRM, opensaf_reboot_fallback) == SIG_ERR) {
> > >>> +            opensaf_reboot_fallback(0);
> > >>> +        }
> > >>> +        alarm(supervision_time);
> > >>> +    }
> > >>> +
> > >>> +    syslog(LOG_CRIT,
> > >>> +        "Rebooting OpenSAF NodeId = %u EE Name = %s, Reason: %s, "
> > >>> +        "OwnNodeId = %u, SupervisionTime = %lu",
> > >>> +        node_id, ee_name == NULL ? "No EE Mapped" : ee_name,
> reason,
> > >>> +        own_node_id, supervision_time);
> > >>>
> > >>>        char str[256];
> > >>> -    memset(str,0,256);
> > >>> +    snprintf(str, sizeof(str), PKGLIBDIR "/opensaf_reboot %u %s",
> > >>> node_id,
> > >>> +        ee_name == NULL ? "" : ee_name);
> > >>> +    int reboot_result = system(str);
> > >>> +    if (reboot_result != EXIT_SUCCESS) {
> > >>> +            syslog(LOG_CRIT, "node reboot failure: exit code %d",
> > >>> +            reboot_result);
> > >>> +    }
> > >>>
> > >>> -    snprintf(str,255,PKGLIBDIR"/opensaf_reboot %d
> > >>> %s\n",node_id,((ee_name == NULL)?"":ee_name));
> > >>> -    syslog(LOG_CRIT,"Rebooting OpenSAF NodeId = %d EE Name = %s,
> > >>> Reason: %s\n",node_id,((ee_name == NULL)? "No EE
> > >>> Mapped":ee_name),reason);
> > >>> -    if(system(str) == -1){
> > >>> -            syslog(LOG_CRIT, "node reboot failure!");
> > >>> +    if (use_fallback) {
> > >>> +        /* Wait for the alarm signal we set up earlier. */
> > >>> +        for (;;) pause();
> > >>>        }
> > >>>    }
> > >>>
> > >>> diff --git a/osaf/services/infrastructure/nid/config/nid.conf
> > >>> b/osaf/services/infrastructure/nid/config/nid.conf
> > >>> --- a/osaf/services/infrastructure/nid/config/nid.conf
> > >>> +++ b/osaf/services/infrastructure/nid/config/nid.conf
> > >>> @@ -23,6 +23,12 @@ OPENSAF_MANAGE_TIPC="yes"
> > >>>    # Specifies how long "opensafd stop" should wait before stop
> > >>> has considered to fail
> > >>>    OPENSAF_TERMTIMEOUT=60
> > >>>
> > >>> +# Number of seconds before a reboot is escalated to an immediate
> > >>> reboot via the
> > >>> +# SysRq interface /proc/sysrq-trigger.  Comment it out or set it
> > >>> +to
> > >>> zero to
> > >>> +# disable this feature.  Note that you must make sure the kernel
> > >>> allows reboot
> > >>> +# via SysRq for this feature to work.
> > >>> +export OPENSAF_REBOOT_TIMEOUT=60
> > >>> +
> > >>>    # Specify the UNIX group and user OpenSAF run as
> > >>>    export OPENSAF_GROUP=opensaf
> > >>>    export OPENSAF_USER=opensaf
> > >>> diff --git a/osaf/services/saf/avsv/amfwdog/amf_wdog.c
> > >>> b/osaf/services/saf/avsv/amfwdog/amf_wdog.c
> > >>> --- a/osaf/services/saf/avsv/amfwdog/amf_wdog.c
> > >>> +++ b/osaf/services/saf/avsv/amfwdog/amf_wdog.c
> > >>> @@ -137,6 +137,7 @@ int main(int argc, char *argv[])
> > >>>        SaAmfHealthcheckKeyT hc_key;
> > >>>        char *hc_key_env;
> > >>>
> > >>> +    opensaf_reboot_prepare();
> > >>>        daemonize(argc, argv);
> > >>>
> > >>>        ava_install_amf_down_cb(amf_down_cb);
> > >>> diff --git a/scripts/opensaf_reboot b/scripts/opensaf_reboot
> > >>> --- a/scripts/opensaf_reboot
> > >>> +++ b/scripts/opensaf_reboot
> > >>> @@ -67,7 +67,15 @@ if [ "$self_node_id" = "$node_id" ] || [
> > >>>        # uncomment the following line if debugging errors that
> > >>> keep restarting the node
> > >>>        # exit 0
> > >>>
> > >>> -    logger -t "opensaf_reboot" "Rebooting local node"
> > >>> +    logger -t "opensaf_reboot" "Rebooting local node;
> > >>> timeout=$OPENSAF_REBOOT_TIMEOUT"
> > >>> +
> > >>> +    # Start a reboot supervision background process. Note that a
> > >>> similar
> > >>> +    # supervision is also done in the opensaf_reboot() function
> > >>> + in
> > >>> LEAP.
> > >>> +    # However, that supervision may be stopped by one of the
> > >>> + pkill
> > >>> commands
> > >>> +    # below, if it was called from AMF or FM.
> > >>> +    if [ "${OPENSAF_REBOOT_TIMEOUT}0" -gt "0" ]; then
> > >>> +        (sleep "$OPENSAF_REBOOT_TIMEOUT"; echo -n "b" >
> > >>> "/proc/sysrq-trigger") &
> > >>> +    fi
> > >>>
> > >>>        # Stop some important opensaf processes to prevent bad
> > >>> things from happening
> > >>>        $icmd pkill -STOP osafamfwd
> > >>>
> > >>> ------------------------------------------------------------------
> > >>> --
> > >>> ----------
> > >>>
> > >>> How ServiceNow helps IT people transform IT departments:
> > >>> 1. A cloud service to automate IT design, transition and
> > >>> operations 2. Dashboards that offer high-level views of enterprise
> services 3.
> > >>> A single system of record for all IT processes
> > >>> http://p.sf.net/sfu/servicenow-d2d-j
> > >>> _______________________________________________
> > >>> Opensaf-devel mailing list
> > >>> [email protected]
> > >>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
> > >>>
> > >>>
> > >>
> > >
> > > --------------------------------------------------------------------
> > > --
> > > -------- How ServiceNow helps IT people transform IT departments:
> > > 1. A cloud service to automate IT design, transition and operations 2.
> > > Dashboards that offer high-level views of enterprise services 3. A
> > > single system of record for all IT processes
> > > http://p.sf.net/sfu/servicenow-d2d-j
> > > _______________________________________________
> > > Opensaf-devel mailing list
> > > [email protected]
> > > https://lists.sourceforge.net/lists/listinfo/opensaf-devel
> > >
> > >
> >
> >
> > ----------------------------------------------------------------------
> > -------- This SF.net email is sponsored by Windows:
> >
> > Build for Windows Store.
> >
> > http://p.sf.net/sfu/windows-dev2dev
> > _______________________________________________
> > Opensaf-devel mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/opensaf-devel

------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to