The problem with using capabilities is that we would need to use either libcap or libcap-ng, neither of which is included in LSB.
regards, Anders Widell On 2013-06-13 17:39, Mathivanan Naickan Palanivelu wrote: > I'm playing with CAP_KILL, CAP_SYS_BOOT and PR_SET_KEEPCAP. > Will get back on this patch tomorrow. > > Cheers. > >> -----Original Message----- >> From: Mathivanan Naickan Palanivelu >> Sent: Thursday, June 13, 2013 8:53 PM >> To: Anders Widell; Hans Feldt >> Cc: [email protected] >> Subject: RE: [devel] [PATCH 1 of 1] osaf: Add time supervision of >> opensaf_reboot [#437] >> >> I was also looking at CAP_SYS_ADMIN(alternative for >> opensaf_reboot_prepare()), as an option until I ran into >> https://lwn.net/Articles/486306/! >> CAP_SYS_ADMIN would make us vulnerable. >> >> Cheers, >> Mathi. >> >> >>> -----Original Message----- >>> From: Anders Widell [mailto:[email protected]] >>> Sent: Tuesday, June 11, 2013 1:56 PM >>> To: Hans Feldt >>> Cc: [email protected] >>> Subject: Re: [devel] [PATCH 1 of 1] osaf: Add time supervision of >>> opensaf_reboot [#437] >>> >>> Maybe I should also point out that the case of getting file >>> descriptors 0, 1 or 2 is not just a hypothetical scenario that I have >>> dreamed up - it actually happens. The code will not work without the retry. >>> >>> I will try to make the code comments more clear, maybe mention the >>> daemonize() function instead of just referring to "dropping root >>> privileges". >>> >>> regards, >>> Anders Widell >>> >>> On 2013-06-10 17:04, Anders Widell wrote: >>>> See comments below. >>>> >>>> regards, >>>> Anders Widell >>>> >>>> On 2013-06-10 15:47, Hans Feldt wrote: >>>>> Why is not opensaf_reboot_prepare() called from all contexts? >>>> What do you mean by all contexts? It is called by amfwd since it >>>> needs to reboot the local node without running as root. As I said in >>>> the review mail (but maybe also should go into the commit message), >>>> amfd and fmd can simply _Exit() to reboot the local node. This can >>>> be a separate enhancement ticket, since it works already now >>>> (opensaf_reboot() will exit when the timer has expired). >>>>> I think the implementation of opensaf_reboot_prepare() requires >>>>> some comments since it does recursion. I think I understand it but >>>>> it is just a little to clever to be uncommented... >>>> I did put a comment just at the point of recursive call, but maybe >>>> it wasn't clear enough? :-) Basically, I don't want to get file >>>> descriptors 0, 1 or 2. So if I do get one of those I try again. >>>>> Why is opensaf_reboot_prepare() called before daemonize()? I guess >>>>> that should be commented since it is probably important. >>>> The comment for opensaf_reboot_prepare() says that it must be called >>>> before dropping root privileges. daemonize() is the function that >>>> drops root privileges, so I think it is fairly clear why it is >>>> called before daemonize(). >>>>> Thanks, >>>>> Hans >>>>> >>>>> >>>>> On 06/10/2013 12:50 PM, Anders Widell wrote: >>>>>> 00-README.conf | 5 + >>>>>> osaf/libs/core/include/ncssysf_def.h | 20 ++++- >>>>>> osaf/libs/core/leap/sysf_def.c | 93 >>>>>> ++++++++++++++++++++++- >>>>>> osaf/services/infrastructure/nid/config/nid.conf | 6 + >>>>>> osaf/services/saf/avsv/amfwdog/amf_wdog.c | 1 + >>>>>> scripts/opensaf_reboot | 10 ++- >>>>>> 6 files changed, 126 insertions(+), 9 deletions(-) >>>>>> >>>>>> >>>>>> Add a time supervision of the library function opensaf_reboot() as >>>>>> well as the shell script opensaf_reboot. If the reboot has not >>>>>> happened before the timeout, the OS is rebooted hard using the >>>>>> SysRq trigger /proc/sysrq-trigger. >>>>>> This makes >>>>>> it possible to reboot the node also when the system is in a very >>>>>> bad state, for example when fork() fails because the system is out >>>>>> of resources (no free memory, process table full etc.). It also >>>>>> handles the case when the ordinary reboot command hangs trying to >>>>>> sync the file system, for example due to a disk or NFS problem. >>>>>> >>>>>> diff --git a/00-README.conf b/00-README.conf >>>>>> --- a/00-README.conf >>>>>> +++ b/00-README.conf >>>>>> @@ -52,6 +52,11 @@ group/user. >>>>>> >>>>>> - Use of MDS subslot ID needs to be enabled, add >>>>>> TIPC_USE_SUBSLOT_ID=YES >>>>>> >>>>>> +- Time supervision of local node reboot should be disabled or >>>>>> changed. Change >>>>>> + OPENSAF_REBOOT_TIMEOUT to the desired number of seconds >>> before a >>>>>> reboot is >>>>>> + escalated to an immediate reboot via the SysRq interface, or >>>>>> + zero >>>>>> to disable >>>>>> + this feature. >>>>>> + >>>>>> >> ********************************************************** >>> ********************* >>>>>> nodeinit.conf >>>>>> >>>>>> diff --git a/osaf/libs/core/include/ncssysf_def.h >>>>>> b/osaf/libs/core/include/ncssysf_def.h >>>>>> --- a/osaf/libs/core/include/ncssysf_def.h >>>>>> +++ b/osaf/libs/core/include/ncssysf_def.h >>>>>> @@ -83,7 +83,25 @@ extern "C" { >>>>>> #define m_START_CRITICAL m_NCS_OS_START_TASK_LOCK >>>>>> #define m_END_CRITICAL m_NCS_OS_END_TASK_LOCK >>>>>> >>>>>> -extern void opensaf_reboot(unsigned int node_id, char *ee_name, >>>>>> const char *reason); >>>>>> +/** >>>>>> + * Prepare for a future call to opensaf_reboot() by opening the >>>>>> necessary >>>>>> + * file (/proc/sysrq-trigger). Call this function before >>>>>> + dropping root >>>>>> + * privileges, if you later intend to call opensaf_reboot() to >>>>>> reboot the local >>>>>> + * node without having root privileges. >>>>>> + */ >>>>>> +void opensaf_reboot_prepare(void); >>>>>> + >>>>>> +/** >>>>>> + * Reboot a node. Call this function with @a node_id zero to >>>>>> +reboot >>>>>> the local >>>>>> + * node. If you intend to use this function to reboot the local >>>>>> node without >>>>>> + * having root privileges, you must first call >>>>>> opensaf_reboot_prepare() before >>>>>> + * dropping root privileges. >>>>>> + * >>>>>> + * Note that this function uses the configuration option >>>>>> OPENSAF_REBOOT_TIMEOUT >>>>>> + * in nid.conf. Therefore, this function must only be called >>>>>> + from >>>>>> services >>>>>> + * that are started by NID. >>>>>> + */ >>>>>> +void opensaf_reboot(unsigned node_id, const char* ee_name, const >>>>>> char* reason); >>>>>> >>>>>> >> /********************************************************** >>> ********* >>>>>> ********** >>>>>> ** ** >>>>>> diff --git a/osaf/libs/core/leap/sysf_def.c >>>>>> b/osaf/libs/core/leap/sysf_def.c >>>>>> --- a/osaf/libs/core/leap/sysf_def.c >>>>>> +++ b/osaf/libs/core/leap/sysf_def.c >>>>>> @@ -26,7 +26,17 @@ >>>>>> >>>>>> #include <configmake.h> >>>>>> >>>>>> -#include <ncsgl_defs.h> >>>>>> +#include <stdio.h> >>>>>> +#include <errno.h> >>>>>> +#include <stdlib.h> >>>>>> +#include <stdbool.h> >>>>>> +#include <sys/stat.h> >>>>>> +#include <fcntl.h> >>>>>> +#include <unistd.h> >>>>>> +#include <signal.h> >>>>>> +#include <syslog.h> >>>>>> +#include "ncs_main_papi.h" >>>>>> +#include "ncsgl_defs.h" >>>>>> #include "ncs_osprm.h" >>>>>> >>>>>> #include "ncs_svd.h" >>>>>> @@ -38,6 +48,7 @@ >>>>>> #include "sysf_exc_scr.h" >>>>>> #include "usrbuf.h" >>>>>> >>>>>> +static int sysrq_trigger_fd = -1; >>>>>> >>>>>> >> /********************************************************** >>> ********* >>>>>> ********** >>>>>> >>>>>> @@ -271,20 +282,88 @@ uint32_t leap_env_destroy() >>>>>> return NCSCC_RC_SUCCESS; >>>>>> } >>>>>> >>>>>> +void opensaf_reboot_prepare(void) { >>>>>> + if (sysrq_trigger_fd != -1) return; >>>>>> + int fd; >>>>>> + do { >>>>>> + fd = open("/proc/sysrq-trigger", O_WRONLY); >>>>>> + } while (fd == -1 && errno == EINTR); >>>>>> + if (fd >= 0 && fd <= 2) { >>>>>> + /* We don't want to get file descriptors 0, 1 or 2 because: >>>>>> + * 1) it would be dangerous >>>>>> + * 2) it would by closed by deamonize() >>>>>> + */ >>>>>> + opensaf_reboot_prepare(); >>>>>> + close(fd); >>>>>> + } else { >>>>>> + sysrq_trigger_fd = fd; >>>>>> + } >>>>>> +} >>>>>> + >>>>>> +static void opensaf_reboot_fallback(int sig_no) { >>>>>> + (void) sig_no; >>>>>> + if (sysrq_trigger_fd == -1) { >>>>>> + do { >>>>>> + sysrq_trigger_fd = open("/proc/sysrq-trigger", O_WRONLY); >>>>>> + } while (sysrq_trigger_fd == -1 && errno == EINTR); >>>>>> + } >>>>>> + if (sysrq_trigger_fd != -1) { >>>>>> + char buf[] = {'b'}; >>>>>> + ssize_t result; >>>>>> + do { >>>>>> + result = write(sysrq_trigger_fd, buf, sizeof(buf)); >>>>>> + } while (result == -1 && errno == EINTR); >>>>>> + } >>>>>> + _Exit(EXIT_SUCCESS); >>>>>> +} >>>>>> + >>>>>> /** >>>>>> * >>>>>> * @param reason >>>>>> */ >>>>>> -void opensaf_reboot(unsigned int node_id, char *ee_name, const >>>>>> char >>>>>> *reason) >>>>>> +void opensaf_reboot(unsigned node_id, const char* ee_name, const >>>>>> char* reason) >>>>>> { >>>>>> + char* env_var = getenv("OPENSAF_REBOOT_TIMEOUT"); >>>>>> + unsigned long supervision_time = 0; >>>>>> + if (env_var != NULL) { >>>>>> + char* endptr; >>>>>> + errno = 0; >>>>>> + supervision_time = strtoul(env_var, &endptr, 0); >>>>>> + if (errno != 0 || *env_var == '\0' || *endptr != '\0') { >>>>>> + supervision_time = 0; >>>>>> + } >>>>>> + } >>>>>> + >>>>>> + unsigned own_node_id = ncs_get_node_id(); >>>>>> + bool use_fallback = supervision_time > 0 && (node_id == 0 || >>>>>> node_id == >>>>>> + own_node_id); >>>>>> + if (use_fallback) { >>>>>> + if (signal(SIGALRM, opensaf_reboot_fallback) == SIG_ERR) { >>>>>> + opensaf_reboot_fallback(0); >>>>>> + } >>>>>> + alarm(supervision_time); >>>>>> + } >>>>>> + >>>>>> + syslog(LOG_CRIT, >>>>>> + "Rebooting OpenSAF NodeId = %u EE Name = %s, Reason: %s, " >>>>>> + "OwnNodeId = %u, SupervisionTime = %lu", >>>>>> + node_id, ee_name == NULL ? "No EE Mapped" : ee_name, >> reason, >>>>>> + own_node_id, supervision_time); >>>>>> >>>>>> char str[256]; >>>>>> - memset(str,0,256); >>>>>> + snprintf(str, sizeof(str), PKGLIBDIR "/opensaf_reboot %u %s", >>>>>> node_id, >>>>>> + ee_name == NULL ? "" : ee_name); >>>>>> + int reboot_result = system(str); >>>>>> + if (reboot_result != EXIT_SUCCESS) { >>>>>> + syslog(LOG_CRIT, "node reboot failure: exit code %d", >>>>>> + reboot_result); >>>>>> + } >>>>>> >>>>>> - snprintf(str,255,PKGLIBDIR"/opensaf_reboot %d >>>>>> %s\n",node_id,((ee_name == NULL)?"":ee_name)); >>>>>> - syslog(LOG_CRIT,"Rebooting OpenSAF NodeId = %d EE Name = %s, >>>>>> Reason: %s\n",node_id,((ee_name == NULL)? "No EE >>>>>> Mapped":ee_name),reason); >>>>>> - if(system(str) == -1){ >>>>>> - syslog(LOG_CRIT, "node reboot failure!"); >>>>>> + if (use_fallback) { >>>>>> + /* Wait for the alarm signal we set up earlier. */ >>>>>> + for (;;) pause(); >>>>>> } >>>>>> } >>>>>> >>>>>> diff --git a/osaf/services/infrastructure/nid/config/nid.conf >>>>>> b/osaf/services/infrastructure/nid/config/nid.conf >>>>>> --- a/osaf/services/infrastructure/nid/config/nid.conf >>>>>> +++ b/osaf/services/infrastructure/nid/config/nid.conf >>>>>> @@ -23,6 +23,12 @@ OPENSAF_MANAGE_TIPC="yes" >>>>>> # Specifies how long "opensafd stop" should wait before stop >>>>>> has considered to fail >>>>>> OPENSAF_TERMTIMEOUT=60 >>>>>> >>>>>> +# Number of seconds before a reboot is escalated to an immediate >>>>>> reboot via the >>>>>> +# SysRq interface /proc/sysrq-trigger. Comment it out or set it >>>>>> +to >>>>>> zero to >>>>>> +# disable this feature. Note that you must make sure the kernel >>>>>> allows reboot >>>>>> +# via SysRq for this feature to work. >>>>>> +export OPENSAF_REBOOT_TIMEOUT=60 >>>>>> + >>>>>> # Specify the UNIX group and user OpenSAF run as >>>>>> export OPENSAF_GROUP=opensaf >>>>>> export OPENSAF_USER=opensaf >>>>>> diff --git a/osaf/services/saf/avsv/amfwdog/amf_wdog.c >>>>>> b/osaf/services/saf/avsv/amfwdog/amf_wdog.c >>>>>> --- a/osaf/services/saf/avsv/amfwdog/amf_wdog.c >>>>>> +++ b/osaf/services/saf/avsv/amfwdog/amf_wdog.c >>>>>> @@ -137,6 +137,7 @@ int main(int argc, char *argv[]) >>>>>> SaAmfHealthcheckKeyT hc_key; >>>>>> char *hc_key_env; >>>>>> >>>>>> + opensaf_reboot_prepare(); >>>>>> daemonize(argc, argv); >>>>>> >>>>>> ava_install_amf_down_cb(amf_down_cb); >>>>>> diff --git a/scripts/opensaf_reboot b/scripts/opensaf_reboot >>>>>> --- a/scripts/opensaf_reboot >>>>>> +++ b/scripts/opensaf_reboot >>>>>> @@ -67,7 +67,15 @@ if [ "$self_node_id" = "$node_id" ] || [ >>>>>> # uncomment the following line if debugging errors that >>>>>> keep restarting the node >>>>>> # exit 0 >>>>>> >>>>>> - logger -t "opensaf_reboot" "Rebooting local node" >>>>>> + logger -t "opensaf_reboot" "Rebooting local node; >>>>>> timeout=$OPENSAF_REBOOT_TIMEOUT" >>>>>> + >>>>>> + # Start a reboot supervision background process. Note that a >>>>>> similar >>>>>> + # supervision is also done in the opensaf_reboot() function >>>>>> + in >>>>>> LEAP. >>>>>> + # However, that supervision may be stopped by one of the >>>>>> + pkill >>>>>> commands >>>>>> + # below, if it was called from AMF or FM. >>>>>> + if [ "${OPENSAF_REBOOT_TIMEOUT}0" -gt "0" ]; then >>>>>> + (sleep "$OPENSAF_REBOOT_TIMEOUT"; echo -n "b" > >>>>>> "/proc/sysrq-trigger") & >>>>>> + fi >>>>>> >>>>>> # Stop some important opensaf processes to prevent bad >>>>>> things from happening >>>>>> $icmd pkill -STOP osafamfwd >>>>>> >>>>>> ------------------------------------------------------------------ >>>>>> -- >>>>>> ---------- >>>>>> >>>>>> How ServiceNow helps IT people transform IT departments: >>>>>> 1. A cloud service to automate IT design, transition and >>>>>> operations 2. Dashboards that offer high-level views of enterprise >> services 3. >>>>>> A single system of record for all IT processes >>>>>> http://p.sf.net/sfu/servicenow-d2d-j >>>>>> _______________________________________________ >>>>>> Opensaf-devel mailing list >>>>>> [email protected] >>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel >>>>>> >>>>>> >>>> -------------------------------------------------------------------- >>>> -- >>>> -------- How ServiceNow helps IT people transform IT departments: >>>> 1. A cloud service to automate IT design, transition and operations 2. >>>> Dashboards that offer high-level views of enterprise services 3. A >>>> single system of record for all IT processes >>>> http://p.sf.net/sfu/servicenow-d2d-j >>>> _______________________________________________ >>>> Opensaf-devel mailing list >>>> [email protected] >>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel >>>> >>>> >>> >>> ---------------------------------------------------------------------- >>> -------- This SF.net email is sponsored by Windows: >>> >>> Build for Windows Store. >>> >>> http://p.sf.net/sfu/windows-dev2dev >>> _______________________________________________ >>> Opensaf-devel mailing list >>> [email protected] >>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel > ------------------------------------------------------------------------------ This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
