I'm playing with CAP_KILL, CAP_SYS_BOOT and PR_SET_KEEPCAP. Will get back on this patch tomorrow.
Cheers. > -----Original Message----- > From: Mathivanan Naickan Palanivelu > Sent: Thursday, June 13, 2013 8:53 PM > To: Anders Widell; Hans Feldt > Cc: [email protected] > Subject: RE: [devel] [PATCH 1 of 1] osaf: Add time supervision of > opensaf_reboot [#437] > > I was also looking at CAP_SYS_ADMIN(alternative for > opensaf_reboot_prepare()), as an option until I ran into > https://lwn.net/Articles/486306/! > CAP_SYS_ADMIN would make us vulnerable. > > Cheers, > Mathi. > > > > -----Original Message----- > > From: Anders Widell [mailto:[email protected]] > > Sent: Tuesday, June 11, 2013 1:56 PM > > To: Hans Feldt > > Cc: [email protected] > > Subject: Re: [devel] [PATCH 1 of 1] osaf: Add time supervision of > > opensaf_reboot [#437] > > > > Maybe I should also point out that the case of getting file > > descriptors 0, 1 or 2 is not just a hypothetical scenario that I have > > dreamed up - it actually happens. The code will not work without the retry. > > > > I will try to make the code comments more clear, maybe mention the > > daemonize() function instead of just referring to "dropping root > > privileges". > > > > regards, > > Anders Widell > > > > On 2013-06-10 17:04, Anders Widell wrote: > > > See comments below. > > > > > > regards, > > > Anders Widell > > > > > > On 2013-06-10 15:47, Hans Feldt wrote: > > >> Why is not opensaf_reboot_prepare() called from all contexts? > > > What do you mean by all contexts? It is called by amfwd since it > > > needs to reboot the local node without running as root. As I said in > > > the review mail (but maybe also should go into the commit message), > > > amfd and fmd can simply _Exit() to reboot the local node. This can > > > be a separate enhancement ticket, since it works already now > > > (opensaf_reboot() will exit when the timer has expired). > > >> I think the implementation of opensaf_reboot_prepare() requires > > >> some comments since it does recursion. I think I understand it but > > >> it is just a little to clever to be uncommented... > > > I did put a comment just at the point of recursive call, but maybe > > > it wasn't clear enough? :-) Basically, I don't want to get file > > > descriptors 0, 1 or 2. So if I do get one of those I try again. > > >> Why is opensaf_reboot_prepare() called before daemonize()? I guess > > >> that should be commented since it is probably important. > > > The comment for opensaf_reboot_prepare() says that it must be called > > > before dropping root privileges. daemonize() is the function that > > > drops root privileges, so I think it is fairly clear why it is > > > called before daemonize(). > > >> Thanks, > > >> Hans > > >> > > >> > > >> On 06/10/2013 12:50 PM, Anders Widell wrote: > > >>> 00-README.conf | 5 + > > >>> osaf/libs/core/include/ncssysf_def.h | 20 ++++- > > >>> osaf/libs/core/leap/sysf_def.c | 93 > > >>> ++++++++++++++++++++++- > > >>> osaf/services/infrastructure/nid/config/nid.conf | 6 + > > >>> osaf/services/saf/avsv/amfwdog/amf_wdog.c | 1 + > > >>> scripts/opensaf_reboot | 10 ++- > > >>> 6 files changed, 126 insertions(+), 9 deletions(-) > > >>> > > >>> > > >>> Add a time supervision of the library function opensaf_reboot() as > > >>> well as the shell script opensaf_reboot. If the reboot has not > > >>> happened before the timeout, the OS is rebooted hard using the > > >>> SysRq trigger /proc/sysrq-trigger. > > >>> This makes > > >>> it possible to reboot the node also when the system is in a very > > >>> bad state, for example when fork() fails because the system is out > > >>> of resources (no free memory, process table full etc.). It also > > >>> handles the case when the ordinary reboot command hangs trying to > > >>> sync the file system, for example due to a disk or NFS problem. > > >>> > > >>> diff --git a/00-README.conf b/00-README.conf > > >>> --- a/00-README.conf > > >>> +++ b/00-README.conf > > >>> @@ -52,6 +52,11 @@ group/user. > > >>> > > >>> - Use of MDS subslot ID needs to be enabled, add > > >>> TIPC_USE_SUBSLOT_ID=YES > > >>> > > >>> +- Time supervision of local node reboot should be disabled or > > >>> changed. Change > > >>> + OPENSAF_REBOOT_TIMEOUT to the desired number of seconds > > before a > > >>> reboot is > > >>> + escalated to an immediate reboot via the SysRq interface, or > > >>> + zero > > >>> to disable > > >>> + this feature. > > >>> + > > >>> > > > ********************************************************** > > ********************* > > >>> nodeinit.conf > > >>> > > >>> diff --git a/osaf/libs/core/include/ncssysf_def.h > > >>> b/osaf/libs/core/include/ncssysf_def.h > > >>> --- a/osaf/libs/core/include/ncssysf_def.h > > >>> +++ b/osaf/libs/core/include/ncssysf_def.h > > >>> @@ -83,7 +83,25 @@ extern "C" { > > >>> #define m_START_CRITICAL m_NCS_OS_START_TASK_LOCK > > >>> #define m_END_CRITICAL m_NCS_OS_END_TASK_LOCK > > >>> > > >>> -extern void opensaf_reboot(unsigned int node_id, char *ee_name, > > >>> const char *reason); > > >>> +/** > > >>> + * Prepare for a future call to opensaf_reboot() by opening the > > >>> necessary > > >>> + * file (/proc/sysrq-trigger). Call this function before > > >>> + dropping root > > >>> + * privileges, if you later intend to call opensaf_reboot() to > > >>> reboot the local > > >>> + * node without having root privileges. > > >>> + */ > > >>> +void opensaf_reboot_prepare(void); > > >>> + > > >>> +/** > > >>> + * Reboot a node. Call this function with @a node_id zero to > > >>> +reboot > > >>> the local > > >>> + * node. If you intend to use this function to reboot the local > > >>> node without > > >>> + * having root privileges, you must first call > > >>> opensaf_reboot_prepare() before > > >>> + * dropping root privileges. > > >>> + * > > >>> + * Note that this function uses the configuration option > > >>> OPENSAF_REBOOT_TIMEOUT > > >>> + * in nid.conf. Therefore, this function must only be called > > >>> + from > > >>> services > > >>> + * that are started by NID. > > >>> + */ > > >>> +void opensaf_reboot(unsigned node_id, const char* ee_name, const > > >>> char* reason); > > >>> > > >>> > > > /********************************************************** > > ********* > > >>> ********** > > >>> ** ** > > >>> diff --git a/osaf/libs/core/leap/sysf_def.c > > >>> b/osaf/libs/core/leap/sysf_def.c > > >>> --- a/osaf/libs/core/leap/sysf_def.c > > >>> +++ b/osaf/libs/core/leap/sysf_def.c > > >>> @@ -26,7 +26,17 @@ > > >>> > > >>> #include <configmake.h> > > >>> > > >>> -#include <ncsgl_defs.h> > > >>> +#include <stdio.h> > > >>> +#include <errno.h> > > >>> +#include <stdlib.h> > > >>> +#include <stdbool.h> > > >>> +#include <sys/stat.h> > > >>> +#include <fcntl.h> > > >>> +#include <unistd.h> > > >>> +#include <signal.h> > > >>> +#include <syslog.h> > > >>> +#include "ncs_main_papi.h" > > >>> +#include "ncsgl_defs.h" > > >>> #include "ncs_osprm.h" > > >>> > > >>> #include "ncs_svd.h" > > >>> @@ -38,6 +48,7 @@ > > >>> #include "sysf_exc_scr.h" > > >>> #include "usrbuf.h" > > >>> > > >>> +static int sysrq_trigger_fd = -1; > > >>> > > >>> > > > /********************************************************** > > ********* > > >>> ********** > > >>> > > >>> @@ -271,20 +282,88 @@ uint32_t leap_env_destroy() > > >>> return NCSCC_RC_SUCCESS; > > >>> } > > >>> > > >>> +void opensaf_reboot_prepare(void) { > > >>> + if (sysrq_trigger_fd != -1) return; > > >>> + int fd; > > >>> + do { > > >>> + fd = open("/proc/sysrq-trigger", O_WRONLY); > > >>> + } while (fd == -1 && errno == EINTR); > > >>> + if (fd >= 0 && fd <= 2) { > > >>> + /* We don't want to get file descriptors 0, 1 or 2 because: > > >>> + * 1) it would be dangerous > > >>> + * 2) it would by closed by deamonize() > > >>> + */ > > >>> + opensaf_reboot_prepare(); > > >>> + close(fd); > > >>> + } else { > > >>> + sysrq_trigger_fd = fd; > > >>> + } > > >>> +} > > >>> + > > >>> +static void opensaf_reboot_fallback(int sig_no) { > > >>> + (void) sig_no; > > >>> + if (sysrq_trigger_fd == -1) { > > >>> + do { > > >>> + sysrq_trigger_fd = open("/proc/sysrq-trigger", O_WRONLY); > > >>> + } while (sysrq_trigger_fd == -1 && errno == EINTR); > > >>> + } > > >>> + if (sysrq_trigger_fd != -1) { > > >>> + char buf[] = {'b'}; > > >>> + ssize_t result; > > >>> + do { > > >>> + result = write(sysrq_trigger_fd, buf, sizeof(buf)); > > >>> + } while (result == -1 && errno == EINTR); > > >>> + } > > >>> + _Exit(EXIT_SUCCESS); > > >>> +} > > >>> + > > >>> /** > > >>> * > > >>> * @param reason > > >>> */ > > >>> -void opensaf_reboot(unsigned int node_id, char *ee_name, const > > >>> char > > >>> *reason) > > >>> +void opensaf_reboot(unsigned node_id, const char* ee_name, const > > >>> char* reason) > > >>> { > > >>> + char* env_var = getenv("OPENSAF_REBOOT_TIMEOUT"); > > >>> + unsigned long supervision_time = 0; > > >>> + if (env_var != NULL) { > > >>> + char* endptr; > > >>> + errno = 0; > > >>> + supervision_time = strtoul(env_var, &endptr, 0); > > >>> + if (errno != 0 || *env_var == '\0' || *endptr != '\0') { > > >>> + supervision_time = 0; > > >>> + } > > >>> + } > > >>> + > > >>> + unsigned own_node_id = ncs_get_node_id(); > > >>> + bool use_fallback = supervision_time > 0 && (node_id == 0 || > > >>> node_id == > > >>> + own_node_id); > > >>> + if (use_fallback) { > > >>> + if (signal(SIGALRM, opensaf_reboot_fallback) == SIG_ERR) { > > >>> + opensaf_reboot_fallback(0); > > >>> + } > > >>> + alarm(supervision_time); > > >>> + } > > >>> + > > >>> + syslog(LOG_CRIT, > > >>> + "Rebooting OpenSAF NodeId = %u EE Name = %s, Reason: %s, " > > >>> + "OwnNodeId = %u, SupervisionTime = %lu", > > >>> + node_id, ee_name == NULL ? "No EE Mapped" : ee_name, > reason, > > >>> + own_node_id, supervision_time); > > >>> > > >>> char str[256]; > > >>> - memset(str,0,256); > > >>> + snprintf(str, sizeof(str), PKGLIBDIR "/opensaf_reboot %u %s", > > >>> node_id, > > >>> + ee_name == NULL ? "" : ee_name); > > >>> + int reboot_result = system(str); > > >>> + if (reboot_result != EXIT_SUCCESS) { > > >>> + syslog(LOG_CRIT, "node reboot failure: exit code %d", > > >>> + reboot_result); > > >>> + } > > >>> > > >>> - snprintf(str,255,PKGLIBDIR"/opensaf_reboot %d > > >>> %s\n",node_id,((ee_name == NULL)?"":ee_name)); > > >>> - syslog(LOG_CRIT,"Rebooting OpenSAF NodeId = %d EE Name = %s, > > >>> Reason: %s\n",node_id,((ee_name == NULL)? "No EE > > >>> Mapped":ee_name),reason); > > >>> - if(system(str) == -1){ > > >>> - syslog(LOG_CRIT, "node reboot failure!"); > > >>> + if (use_fallback) { > > >>> + /* Wait for the alarm signal we set up earlier. */ > > >>> + for (;;) pause(); > > >>> } > > >>> } > > >>> > > >>> diff --git a/osaf/services/infrastructure/nid/config/nid.conf > > >>> b/osaf/services/infrastructure/nid/config/nid.conf > > >>> --- a/osaf/services/infrastructure/nid/config/nid.conf > > >>> +++ b/osaf/services/infrastructure/nid/config/nid.conf > > >>> @@ -23,6 +23,12 @@ OPENSAF_MANAGE_TIPC="yes" > > >>> # Specifies how long "opensafd stop" should wait before stop > > >>> has considered to fail > > >>> OPENSAF_TERMTIMEOUT=60 > > >>> > > >>> +# Number of seconds before a reboot is escalated to an immediate > > >>> reboot via the > > >>> +# SysRq interface /proc/sysrq-trigger. Comment it out or set it > > >>> +to > > >>> zero to > > >>> +# disable this feature. Note that you must make sure the kernel > > >>> allows reboot > > >>> +# via SysRq for this feature to work. > > >>> +export OPENSAF_REBOOT_TIMEOUT=60 > > >>> + > > >>> # Specify the UNIX group and user OpenSAF run as > > >>> export OPENSAF_GROUP=opensaf > > >>> export OPENSAF_USER=opensaf > > >>> diff --git a/osaf/services/saf/avsv/amfwdog/amf_wdog.c > > >>> b/osaf/services/saf/avsv/amfwdog/amf_wdog.c > > >>> --- a/osaf/services/saf/avsv/amfwdog/amf_wdog.c > > >>> +++ b/osaf/services/saf/avsv/amfwdog/amf_wdog.c > > >>> @@ -137,6 +137,7 @@ int main(int argc, char *argv[]) > > >>> SaAmfHealthcheckKeyT hc_key; > > >>> char *hc_key_env; > > >>> > > >>> + opensaf_reboot_prepare(); > > >>> daemonize(argc, argv); > > >>> > > >>> ava_install_amf_down_cb(amf_down_cb); > > >>> diff --git a/scripts/opensaf_reboot b/scripts/opensaf_reboot > > >>> --- a/scripts/opensaf_reboot > > >>> +++ b/scripts/opensaf_reboot > > >>> @@ -67,7 +67,15 @@ if [ "$self_node_id" = "$node_id" ] || [ > > >>> # uncomment the following line if debugging errors that > > >>> keep restarting the node > > >>> # exit 0 > > >>> > > >>> - logger -t "opensaf_reboot" "Rebooting local node" > > >>> + logger -t "opensaf_reboot" "Rebooting local node; > > >>> timeout=$OPENSAF_REBOOT_TIMEOUT" > > >>> + > > >>> + # Start a reboot supervision background process. Note that a > > >>> similar > > >>> + # supervision is also done in the opensaf_reboot() function > > >>> + in > > >>> LEAP. > > >>> + # However, that supervision may be stopped by one of the > > >>> + pkill > > >>> commands > > >>> + # below, if it was called from AMF or FM. > > >>> + if [ "${OPENSAF_REBOOT_TIMEOUT}0" -gt "0" ]; then > > >>> + (sleep "$OPENSAF_REBOOT_TIMEOUT"; echo -n "b" > > > >>> "/proc/sysrq-trigger") & > > >>> + fi > > >>> > > >>> # Stop some important opensaf processes to prevent bad > > >>> things from happening > > >>> $icmd pkill -STOP osafamfwd > > >>> > > >>> ------------------------------------------------------------------ > > >>> -- > > >>> ---------- > > >>> > > >>> How ServiceNow helps IT people transform IT departments: > > >>> 1. A cloud service to automate IT design, transition and > > >>> operations 2. Dashboards that offer high-level views of enterprise > services 3. > > >>> A single system of record for all IT processes > > >>> http://p.sf.net/sfu/servicenow-d2d-j > > >>> _______________________________________________ > > >>> Opensaf-devel mailing list > > >>> [email protected] > > >>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel > > >>> > > >>> > > >> > > > > > > -------------------------------------------------------------------- > > > -- > > > -------- How ServiceNow helps IT people transform IT departments: > > > 1. A cloud service to automate IT design, transition and operations 2. > > > Dashboards that offer high-level views of enterprise services 3. A > > > single system of record for all IT processes > > > http://p.sf.net/sfu/servicenow-d2d-j > > > _______________________________________________ > > > Opensaf-devel mailing list > > > [email protected] > > > https://lists.sourceforge.net/lists/listinfo/opensaf-devel > > > > > > > > > > > > ---------------------------------------------------------------------- > > -------- This SF.net email is sponsored by Windows: > > > > Build for Windows Store. > > > > http://p.sf.net/sfu/windows-dev2dev > > _______________________________________________ > > Opensaf-devel mailing list > > [email protected] > > https://lists.sourceforge.net/lists/listinfo/opensaf-devel ------------------------------------------------------------------------------ This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
