You are right. I have given up on that. However, alternatively What do you think if we move this opensaf_reboot_prepare() inside daemonize and call it for programs whose basename matches "fmd", "amfwd" and "clmd" ?
Thanks, Mathi. > -----Original Message----- > From: Anders Widell [mailto:[email protected]] > Sent: Friday, June 14, 2013 5:17 PM > To: Mathivanan Naickan Palanivelu > Cc: Hans Feldt; [email protected] > Subject: Re: [devel] [PATCH 1 of 1] osaf: Add time supervision of > opensaf_reboot [#437] > > The problem with using capabilities is that we would need to use either libcap > or libcap-ng, neither of which is included in LSB. > > regards, > Anders Widell > > On 2013-06-13 17:39, Mathivanan Naickan Palanivelu wrote: > > I'm playing with CAP_KILL, CAP_SYS_BOOT and PR_SET_KEEPCAP. > > Will get back on this patch tomorrow. > > > > Cheers. > > > >> -----Original Message----- > >> From: Mathivanan Naickan Palanivelu > >> Sent: Thursday, June 13, 2013 8:53 PM > >> To: Anders Widell; Hans Feldt > >> Cc: [email protected] > >> Subject: RE: [devel] [PATCH 1 of 1] osaf: Add time supervision of > >> opensaf_reboot [#437] > >> > >> I was also looking at CAP_SYS_ADMIN(alternative for > >> opensaf_reboot_prepare()), as an option until I ran into > >> https://lwn.net/Articles/486306/! > >> CAP_SYS_ADMIN would make us vulnerable. > >> > >> Cheers, > >> Mathi. > >> > >> > >>> -----Original Message----- > >>> From: Anders Widell [mailto:[email protected]] > >>> Sent: Tuesday, June 11, 2013 1:56 PM > >>> To: Hans Feldt > >>> Cc: [email protected] > >>> Subject: Re: [devel] [PATCH 1 of 1] osaf: Add time supervision of > >>> opensaf_reboot [#437] > >>> > >>> Maybe I should also point out that the case of getting file > >>> descriptors 0, 1 or 2 is not just a hypothetical scenario that I > >>> have dreamed up - it actually happens. The code will not work without > the retry. > >>> > >>> I will try to make the code comments more clear, maybe mention the > >>> daemonize() function instead of just referring to "dropping root > privileges". > >>> > >>> regards, > >>> Anders Widell > >>> > >>> On 2013-06-10 17:04, Anders Widell wrote: > >>>> See comments below. > >>>> > >>>> regards, > >>>> Anders Widell > >>>> > >>>> On 2013-06-10 15:47, Hans Feldt wrote: > >>>>> Why is not opensaf_reboot_prepare() called from all contexts? > >>>> What do you mean by all contexts? It is called by amfwd since it > >>>> needs to reboot the local node without running as root. As I said > >>>> in the review mail (but maybe also should go into the commit > >>>> message), amfd and fmd can simply _Exit() to reboot the local node. > >>>> This can be a separate enhancement ticket, since it works already > >>>> now > >>>> (opensaf_reboot() will exit when the timer has expired). > >>>>> I think the implementation of opensaf_reboot_prepare() requires > >>>>> some comments since it does recursion. I think I understand it but > >>>>> it is just a little to clever to be uncommented... > >>>> I did put a comment just at the point of recursive call, but maybe > >>>> it wasn't clear enough? :-) Basically, I don't want to get file > >>>> descriptors 0, 1 or 2. So if I do get one of those I try again. > >>>>> Why is opensaf_reboot_prepare() called before daemonize()? I guess > >>>>> that should be commented since it is probably important. > >>>> The comment for opensaf_reboot_prepare() says that it must be > >>>> called before dropping root privileges. daemonize() is the function > >>>> that drops root privileges, so I think it is fairly clear why it is > >>>> called before daemonize(). > >>>>> Thanks, > >>>>> Hans > >>>>> > >>>>> > >>>>> On 06/10/2013 12:50 PM, Anders Widell wrote: > >>>>>> 00-README.conf | 5 + > >>>>>> osaf/libs/core/include/ncssysf_def.h | 20 ++++- > >>>>>> osaf/libs/core/leap/sysf_def.c | 93 > >>>>>> ++++++++++++++++++++++- > >>>>>> osaf/services/infrastructure/nid/config/nid.conf | 6 + > >>>>>> osaf/services/saf/avsv/amfwdog/amf_wdog.c | 1 + > >>>>>> scripts/opensaf_reboot | 10 ++- > >>>>>> 6 files changed, 126 insertions(+), 9 deletions(-) > >>>>>> > >>>>>> > >>>>>> Add a time supervision of the library function opensaf_reboot() > >>>>>> as well as the shell script opensaf_reboot. If the reboot has not > >>>>>> happened before the timeout, the OS is rebooted hard using the > >>>>>> SysRq trigger /proc/sysrq-trigger. > >>>>>> This makes > >>>>>> it possible to reboot the node also when the system is in a very > >>>>>> bad state, for example when fork() fails because the system is > >>>>>> out of resources (no free memory, process table full etc.). It > >>>>>> also handles the case when the ordinary reboot command hangs > >>>>>> trying to sync the file system, for example due to a disk or NFS > problem. > >>>>>> > >>>>>> diff --git a/00-README.conf b/00-README.conf > >>>>>> --- a/00-README.conf > >>>>>> +++ b/00-README.conf > >>>>>> @@ -52,6 +52,11 @@ group/user. > >>>>>> > >>>>>> - Use of MDS subslot ID needs to be enabled, add > >>>>>> TIPC_USE_SUBSLOT_ID=YES > >>>>>> > >>>>>> +- Time supervision of local node reboot should be disabled or > >>>>>> changed. Change > >>>>>> + OPENSAF_REBOOT_TIMEOUT to the desired number of seconds > >>> before a > >>>>>> reboot is > >>>>>> + escalated to an immediate reboot via the SysRq interface, or > >>>>>> + zero > >>>>>> to disable > >>>>>> + this feature. > >>>>>> + > >>>>>> > >> > ********************************************************** > >>> ********************* > >>>>>> nodeinit.conf > >>>>>> > >>>>>> diff --git a/osaf/libs/core/include/ncssysf_def.h > >>>>>> b/osaf/libs/core/include/ncssysf_def.h > >>>>>> --- a/osaf/libs/core/include/ncssysf_def.h > >>>>>> +++ b/osaf/libs/core/include/ncssysf_def.h > >>>>>> @@ -83,7 +83,25 @@ extern "C" { > >>>>>> #define m_START_CRITICAL m_NCS_OS_START_TASK_LOCK > >>>>>> #define m_END_CRITICAL m_NCS_OS_END_TASK_LOCK > >>>>>> > >>>>>> -extern void opensaf_reboot(unsigned int node_id, char *ee_name, > >>>>>> const char *reason); > >>>>>> +/** > >>>>>> + * Prepare for a future call to opensaf_reboot() by opening the > >>>>>> necessary > >>>>>> + * file (/proc/sysrq-trigger). Call this function before > >>>>>> + dropping root > >>>>>> + * privileges, if you later intend to call opensaf_reboot() to > >>>>>> reboot the local > >>>>>> + * node without having root privileges. > >>>>>> + */ > >>>>>> +void opensaf_reboot_prepare(void); > >>>>>> + > >>>>>> +/** > >>>>>> + * Reboot a node. Call this function with @a node_id zero to > >>>>>> +reboot > >>>>>> the local > >>>>>> + * node. If you intend to use this function to reboot the local > >>>>>> node without > >>>>>> + * having root privileges, you must first call > >>>>>> opensaf_reboot_prepare() before > >>>>>> + * dropping root privileges. > >>>>>> + * > >>>>>> + * Note that this function uses the configuration option > >>>>>> OPENSAF_REBOOT_TIMEOUT > >>>>>> + * in nid.conf. Therefore, this function must only be called > >>>>>> + from > >>>>>> services > >>>>>> + * that are started by NID. > >>>>>> + */ > >>>>>> +void opensaf_reboot(unsigned node_id, const char* ee_name, > const > >>>>>> char* reason); > >>>>>> > >>>>>> > >> > /********************************************************** > >>> ********* > >>>>>> ********** > >>>>>> ** ** > >>>>>> diff --git a/osaf/libs/core/leap/sysf_def.c > >>>>>> b/osaf/libs/core/leap/sysf_def.c > >>>>>> --- a/osaf/libs/core/leap/sysf_def.c > >>>>>> +++ b/osaf/libs/core/leap/sysf_def.c > >>>>>> @@ -26,7 +26,17 @@ > >>>>>> > >>>>>> #include <configmake.h> > >>>>>> > >>>>>> -#include <ncsgl_defs.h> > >>>>>> +#include <stdio.h> > >>>>>> +#include <errno.h> > >>>>>> +#include <stdlib.h> > >>>>>> +#include <stdbool.h> > >>>>>> +#include <sys/stat.h> > >>>>>> +#include <fcntl.h> > >>>>>> +#include <unistd.h> > >>>>>> +#include <signal.h> > >>>>>> +#include <syslog.h> > >>>>>> +#include "ncs_main_papi.h" > >>>>>> +#include "ncsgl_defs.h" > >>>>>> #include "ncs_osprm.h" > >>>>>> > >>>>>> #include "ncs_svd.h" > >>>>>> @@ -38,6 +48,7 @@ > >>>>>> #include "sysf_exc_scr.h" > >>>>>> #include "usrbuf.h" > >>>>>> > >>>>>> +static int sysrq_trigger_fd = -1; > >>>>>> > >>>>>> > >> > /********************************************************** > >>> ********* > >>>>>> ********** > >>>>>> > >>>>>> @@ -271,20 +282,88 @@ uint32_t leap_env_destroy() > >>>>>> return NCSCC_RC_SUCCESS; > >>>>>> } > >>>>>> > >>>>>> +void opensaf_reboot_prepare(void) { > >>>>>> + if (sysrq_trigger_fd != -1) return; > >>>>>> + int fd; > >>>>>> + do { > >>>>>> + fd = open("/proc/sysrq-trigger", O_WRONLY); > >>>>>> + } while (fd == -1 && errno == EINTR); > >>>>>> + if (fd >= 0 && fd <= 2) { > >>>>>> + /* We don't want to get file descriptors 0, 1 or 2 because: > >>>>>> + * 1) it would be dangerous > >>>>>> + * 2) it would by closed by deamonize() > >>>>>> + */ > >>>>>> + opensaf_reboot_prepare(); > >>>>>> + close(fd); > >>>>>> + } else { > >>>>>> + sysrq_trigger_fd = fd; > >>>>>> + } > >>>>>> +} > >>>>>> + > >>>>>> +static void opensaf_reboot_fallback(int sig_no) { > >>>>>> + (void) sig_no; > >>>>>> + if (sysrq_trigger_fd == -1) { > >>>>>> + do { > >>>>>> + sysrq_trigger_fd = open("/proc/sysrq-trigger", O_WRONLY); > >>>>>> + } while (sysrq_trigger_fd == -1 && errno == EINTR); > >>>>>> + } > >>>>>> + if (sysrq_trigger_fd != -1) { > >>>>>> + char buf[] = {'b'}; > >>>>>> + ssize_t result; > >>>>>> + do { > >>>>>> + result = write(sysrq_trigger_fd, buf, sizeof(buf)); > >>>>>> + } while (result == -1 && errno == EINTR); > >>>>>> + } > >>>>>> + _Exit(EXIT_SUCCESS); > >>>>>> +} > >>>>>> + > >>>>>> /** > >>>>>> * > >>>>>> * @param reason > >>>>>> */ > >>>>>> -void opensaf_reboot(unsigned int node_id, char *ee_name, const > >>>>>> char > >>>>>> *reason) > >>>>>> +void opensaf_reboot(unsigned node_id, const char* ee_name, > const > >>>>>> char* reason) > >>>>>> { > >>>>>> + char* env_var = getenv("OPENSAF_REBOOT_TIMEOUT"); > >>>>>> + unsigned long supervision_time = 0; > >>>>>> + if (env_var != NULL) { > >>>>>> + char* endptr; > >>>>>> + errno = 0; > >>>>>> + supervision_time = strtoul(env_var, &endptr, 0); > >>>>>> + if (errno != 0 || *env_var == '\0' || *endptr != '\0') { > >>>>>> + supervision_time = 0; > >>>>>> + } > >>>>>> + } > >>>>>> + > >>>>>> + unsigned own_node_id = ncs_get_node_id(); > >>>>>> + bool use_fallback = supervision_time > 0 && (node_id == 0 || > >>>>>> node_id == > >>>>>> + own_node_id); > >>>>>> + if (use_fallback) { > >>>>>> + if (signal(SIGALRM, opensaf_reboot_fallback) == SIG_ERR) { > >>>>>> + opensaf_reboot_fallback(0); > >>>>>> + } > >>>>>> + alarm(supervision_time); > >>>>>> + } > >>>>>> + > >>>>>> + syslog(LOG_CRIT, > >>>>>> + "Rebooting OpenSAF NodeId = %u EE Name = %s, Reason: %s, > " > >>>>>> + "OwnNodeId = %u, SupervisionTime = %lu", > >>>>>> + node_id, ee_name == NULL ? "No EE Mapped" : ee_name, > >> reason, > >>>>>> + own_node_id, supervision_time); > >>>>>> > >>>>>> char str[256]; > >>>>>> - memset(str,0,256); > >>>>>> + snprintf(str, sizeof(str), PKGLIBDIR "/opensaf_reboot %u > >>>>>> + %s", > >>>>>> node_id, > >>>>>> + ee_name == NULL ? "" : ee_name); > >>>>>> + int reboot_result = system(str); > >>>>>> + if (reboot_result != EXIT_SUCCESS) { > >>>>>> + syslog(LOG_CRIT, "node reboot failure: exit code %d", > >>>>>> + reboot_result); > >>>>>> + } > >>>>>> > >>>>>> - snprintf(str,255,PKGLIBDIR"/opensaf_reboot %d > >>>>>> %s\n",node_id,((ee_name == NULL)?"":ee_name)); > >>>>>> - syslog(LOG_CRIT,"Rebooting OpenSAF NodeId = %d EE Name = > %s, > >>>>>> Reason: %s\n",node_id,((ee_name == NULL)? "No EE > >>>>>> Mapped":ee_name),reason); > >>>>>> - if(system(str) == -1){ > >>>>>> - syslog(LOG_CRIT, "node reboot failure!"); > >>>>>> + if (use_fallback) { > >>>>>> + /* Wait for the alarm signal we set up earlier. */ > >>>>>> + for (;;) pause(); > >>>>>> } > >>>>>> } > >>>>>> > >>>>>> diff --git a/osaf/services/infrastructure/nid/config/nid.conf > >>>>>> b/osaf/services/infrastructure/nid/config/nid.conf > >>>>>> --- a/osaf/services/infrastructure/nid/config/nid.conf > >>>>>> +++ b/osaf/services/infrastructure/nid/config/nid.conf > >>>>>> @@ -23,6 +23,12 @@ OPENSAF_MANAGE_TIPC="yes" > >>>>>> # Specifies how long "opensafd stop" should wait before stop > >>>>>> has considered to fail > >>>>>> OPENSAF_TERMTIMEOUT=60 > >>>>>> > >>>>>> +# Number of seconds before a reboot is escalated to an immediate > >>>>>> reboot via the > >>>>>> +# SysRq interface /proc/sysrq-trigger. Comment it out or set it > >>>>>> +to > >>>>>> zero to > >>>>>> +# disable this feature. Note that you must make sure the kernel > >>>>>> allows reboot > >>>>>> +# via SysRq for this feature to work. > >>>>>> +export OPENSAF_REBOOT_TIMEOUT=60 > >>>>>> + > >>>>>> # Specify the UNIX group and user OpenSAF run as > >>>>>> export OPENSAF_GROUP=opensaf > >>>>>> export OPENSAF_USER=opensaf > >>>>>> diff --git a/osaf/services/saf/avsv/amfwdog/amf_wdog.c > >>>>>> b/osaf/services/saf/avsv/amfwdog/amf_wdog.c > >>>>>> --- a/osaf/services/saf/avsv/amfwdog/amf_wdog.c > >>>>>> +++ b/osaf/services/saf/avsv/amfwdog/amf_wdog.c > >>>>>> @@ -137,6 +137,7 @@ int main(int argc, char *argv[]) > >>>>>> SaAmfHealthcheckKeyT hc_key; > >>>>>> char *hc_key_env; > >>>>>> > >>>>>> + opensaf_reboot_prepare(); > >>>>>> daemonize(argc, argv); > >>>>>> > >>>>>> ava_install_amf_down_cb(amf_down_cb); > >>>>>> diff --git a/scripts/opensaf_reboot b/scripts/opensaf_reboot > >>>>>> --- a/scripts/opensaf_reboot > >>>>>> +++ b/scripts/opensaf_reboot > >>>>>> @@ -67,7 +67,15 @@ if [ "$self_node_id" = "$node_id" ] || [ > >>>>>> # uncomment the following line if debugging errors that > >>>>>> keep restarting the node > >>>>>> # exit 0 > >>>>>> > >>>>>> - logger -t "opensaf_reboot" "Rebooting local node" > >>>>>> + logger -t "opensaf_reboot" "Rebooting local node; > >>>>>> timeout=$OPENSAF_REBOOT_TIMEOUT" > >>>>>> + > >>>>>> + # Start a reboot supervision background process. Note that a > >>>>>> similar > >>>>>> + # supervision is also done in the opensaf_reboot() function > >>>>>> + in > >>>>>> LEAP. > >>>>>> + # However, that supervision may be stopped by one of the > >>>>>> + pkill > >>>>>> commands > >>>>>> + # below, if it was called from AMF or FM. > >>>>>> + if [ "${OPENSAF_REBOOT_TIMEOUT}0" -gt "0" ]; then > >>>>>> + (sleep "$OPENSAF_REBOOT_TIMEOUT"; echo -n "b" > > >>>>>> "/proc/sysrq-trigger") & > >>>>>> + fi > >>>>>> > >>>>>> # Stop some important opensaf processes to prevent bad > >>>>>> things from happening > >>>>>> $icmd pkill -STOP osafamfwd > >>>>>> > >>>>>> ----------------------------------------------------------------- > >>>>>> - > >>>>>> -- > >>>>>> ---------- > >>>>>> > >>>>>> How ServiceNow helps IT people transform IT departments: > >>>>>> 1. A cloud service to automate IT design, transition and > >>>>>> operations 2. Dashboards that offer high-level views of > >>>>>> enterprise > >> services 3. > >>>>>> A single system of record for all IT processes > >>>>>> http://p.sf.net/sfu/servicenow-d2d-j > >>>>>> _______________________________________________ > >>>>>> Opensaf-devel mailing list > >>>>>> [email protected] > >>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel > >>>>>> > >>>>>> > >>>> ------------------------------------------------------------------- > >>>> - > >>>> -- > >>>> -------- How ServiceNow helps IT people transform IT departments: > >>>> 1. A cloud service to automate IT design, transition and operations 2. > >>>> Dashboards that offer high-level views of enterprise services 3. A > >>>> single system of record for all IT processes > >>>> http://p.sf.net/sfu/servicenow-d2d-j > >>>> _______________________________________________ > >>>> Opensaf-devel mailing list > >>>> [email protected] > >>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel > >>>> > >>>> > >>> > >>> -------------------------------------------------------------------- > >>> -- > >>> -------- This SF.net email is sponsored by Windows: > >>> > >>> Build for Windows Store. > >>> > >>> http://p.sf.net/sfu/windows-dev2dev > >>> _______________________________________________ > >>> Opensaf-devel mailing list > >>> [email protected] > >>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel > > > ------------------------------------------------------------------------------ This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
