Looks like we are in the need of a new non-blocking logging API. Maybe a solution that can also be used to solve the MDS logging issue (where it writes to a while while holding a lock).
regards, Anders Widell 2013-08-09 09:03, Hans Nordebäck skrev: > Hi, we got feedback from one site with the alarm patch applied (#532). > In this case it hangs when calling syslog: > > #0 0x00007f03b0b9db35 in raise () from /lib64/libc.so.6 > #0 0x00007f03b0b9db35 in raise () from /lib64/libc.so.6 > #1 0x00007f03b0b9f111 in abort () from /lib64/libc.so.6 > #2 0x00007f03b1e1dd99 in sigalrm_handler (sig=5928) at os_defs.c:79 > #3 <signal handler called> > #4 0x00007f03b0c51dac in __lll_lock_wait_private () from > /lib64/libc.so.6 > #5 0x00007f03b0c04e0d in _L_lock_1598 () from /lib64/libc.so.6 > #6 0x00007f03b0c04bc6 in __tz_convert () from /lib64/libc.so.6 > #7 0x00007f03b0c40f6a in __vsyslog_chk () from /lib64/libc.so.6 > #8 0x00007f03b0c415b0 in syslog () from /lib64/libc.so.6 > #9 0x00007f03b1e1e992 in ncs_os_process_execute_timed > (req=0x7fff10644ef0) at os_defs.c:1086 > #10 0x0000000000416d03 in avnd_comp_clc_cmd_execute (cb=0x6578c0, > comp=0x67f680, cmd_type=AVND_COMP_CLC_CMD_TYPE_INSTANTIATE) at > avnd_clc.c:2702 > #11 0x000000000041a810 in avnd_comp_clc_uninst_inst_hdler > (cb=0x6578c0, comp=0x67f680) at avnd_clc.c:1384 > #12 0x00000000004187fb in avnd_comp_clc_fsm_run (cb=0x6578c0, > comp=0x67f680, ev=AVND_COMP_CLC_PRES_FSM_EV_INST) at avnd_clc.c:874 > #13 0x000000000043438f in avnd_su_pres_insting_compinst_hdler > (cb=0x6578c0, su=0x676380, comp=0x680750) at avnd_susm.c:1759 > #14 0x0000000000435bc4 in avnd_su_pres_fsm_run (cb=0x6578c0, > su=0x676380, comp=0x680750, ev=AVND_SU_PRES_FSM_EV_COMP_INSTANTIATED) > at avnd_susm.c:1293 > #15 0x00000000004175cb in avnd_comp_clc_st_chng_prc (cb=0x6578c0, > comp=0x680750, prv_st=SA_AMF_PRESENCE_INSTANTIATING, > final_st=SA_AMF_PRESENCE_INSTANTIATED) at avnd_clc.c:1313 > #16 0x000000000041888f in avnd_comp_clc_fsm_run (cb=0x6578c0, > comp=0x680750, ev=AVND_COMP_CLC_PRES_FSM_EV_INST_SUCC) at avnd_clc.c:897 > #17 0x000000000041adb8 in avnd_evt_clc_resp_evh (cb=0x6578c0, > evt=0x67d0e0) at avnd_clc.c:446 > #18 0x00000000004300e0 in avnd_evt_process (evt=<optimized out>) at > avnd_proc.c:278 > #19 avnd_main_process () at avnd_proc.c:219 > #20 0x0000000000408815 in main (argc=1, argv=0x7fff10645988) at > amfnd_main.c:61 > > > /HansN > > On 08/06/13 09:34, praveen malviya wrote: >> Hi, >> I added a new syslog entry(to debug) before execv and was >> able to reproduce the problem. >> The reason for the time out seems to be that syslog was *hung*, but >> *after* writing the new log message into /var/log/messages. >> So, there is a high possibility that the child process gets hung in >> syslog or other file operations. >> See bt below: >> Thread 1 (Thread 0x7f030a16f700 (LWP 16331)): >> >> #0 0x00007f0308de4b1e in __lll_lock_wait_private () from >> /lib64/libc.so.6 >> #1 0x00007f0308dd4398 in _L_lock_634 () from /lib64/libc.so.6 >> #2 0x00007f0308dd3e62 in __vsyslog_chk () from /lib64/libc.so.6 >> #3 0x00007f0308dd4320 in syslog () from /lib64/libc.so.6 >> #4 0x00007f0309d38feb in ncs_os_process_execute_timed >> (req=0x7fffdcc52770, flag=false) at os_defs.c:1059 >> #5 0x0000000000416bd6 in avnd_comp_clc_cmd_execute (cb=0x6578a0, >> comp=0x673840, cmd_type=AVND_COMP_CLC_CMD_TYPE_INSTANTIATE) >> at avnd_clc.c:2687 >> #6 0x000000000041a630 in avnd_comp_clc_uninst_inst_hdler >> (cb=0x6578a0, comp=0x673840) at avnd_clc.c:1384 >> #7 0x000000000041865b in avnd_comp_clc_fsm_run (cb=0x6578a0, >> comp=0x673840, ev=AVND_COMP_CLC_PRES_FSM_EV_INST) >> at avnd_clc.c:874 >> #8 0x0000000000433f5f in avnd_su_pres_insting_compinst_hdler >> (cb=0x6578a0, su=0x66fc70, comp=0x674e10) at avnd_susm.c:1759 >> #9 0x0000000000435764 in avnd_su_pres_fsm_run (cb=0x6578a0, >> su=0x66fc70, comp=0x674e10, >> ev=AVND_SU_PRES_FSM_EV_COMP_INSTANTIATED) at avnd_susm.c:1293 >> #10 0x000000000041748b in avnd_comp_clc_st_chng_prc (cb=0x6578a0, >> comp=0x674e10, prv_st=SA_AMF_PRESENCE_INSTANTIATING, >> final_st=SA_AMF_PRESENCE_INSTANTIATED) at avnd_clc.c:1313 >> #11 0x00000000004186ef in avnd_comp_clc_fsm_run (cb=0x6578a0, >> comp=0x674e10, ev=AVND_COMP_CLC_PRES_FSM_EV_INST_SUCC) >> at avnd_clc.c:897 >> #12 0x000000000041f400 in avnd_comp_reg_prc (cb=0x6578a0, >> comp=0x674e10, pxy_comp=<optimized out>, reg=0x677528, >> dest=0x677518) at avnd_comp.c:709 >> #13 0x000000000041fbd7 in avnd_evt_ava_comp_reg_evh (cb=0x6578a0, >> evt=0x683530) at avnd_comp.c:212 >> #14 0x000000000042fcc8 in avnd_evt_process (evt=<optimized out>) at >> avnd_proc.c:278 >> #15 avnd_main_process () at avnd_proc.c:219 >> #16 0x0000000000408835 in main (argc=2, argv=0x7fffdcc53588) at >> amfnd_main.c:53 >> >> >> Thanks, >> Praveen. >> >> >> On 03-Aug-13 9:47 PM, Hans Nordebäck wrote: >>> Hi, >>> >>> I think the problem with "child before exec" "hangs" and the problem >>> where "child" at cleanup times out instantly, even though a 10 sec. >>> timeout is set, are related. >>> >>> Some reasons why the child seems to hang could be: >>> >>> 1. Not run at all >>> 2. "Hangs" in e.g. close. >>> 3. Crashes, but then whe should have core dumps. >>> >>> but the problem with "hanging" and instantly times out could be: >>> >>> 4. When parent process adds the pid in the patricia tree, >>> add_new_req_pid_in_list, ncs_exec_mod_hdlr >>> is able to kill this child pid before it has started. >>> A e.g. race condtion in this code could explain both these faults. >>> I looked briefly in this code and one thing is the delivery of the >>> SIGCHLD signal, any thread can get that signal, >>> e.g amfnd, mds, timer or ncs_exec_mod_hdlr. I think a dedicated >>> thread should handle SIGCHLD. >>> I guess we should look closer to this part. >>> >>> /BR HansN >>> ________________________________________ >>> Från: Hans Nordebäck >>> Skickat: den 30 juli 2013 13:05 >>> Till: Nagendra Kumar >>> Cc: Ramesh Babu Betham; [email protected]; Praveen >>> Malviya; Hans Feldt >>> Ämne: RE: [PATCH 1 of 1] leap: ncs_os_process_execute_timed child >>> process takes too long time before exec (#514) >>> >>> Hi Nagu, I have tested the patch and simulated "hanging" in the >>> child part, the core dump produced points out where. >>> The system works as before, exept the core dump if it hangs in the >>> child part. We have this patch running at a site >>> waiting for it to trigger, so we don't know yet what is causing this >>> and if the patch will help. >>> >>> /BR HansN >>> >>> -----Original Message----- >>> From: Nagendra Kumar [mailto:[email protected]] >>> Sent: den 30 juli 2013 13:05 >>> To: Hans Nordebäck >>> Cc: Ramesh Babu Betham; [email protected]; Praveen >>> Malviya; Hans Feldt >>> Subject: RE: [PATCH 1 of 1] leap: ncs_os_process_execute_timed child >>> process takes too long time before exec (#514) >>> >>> Hi Hans N, >>> >>> Close is not a sync call. I am still not convinced, these calls may >>> get blocked because of NFS system issue. >>> Have you tested by keeping abort in child process and see how amfnd >>> or other services behaves(just to simulate abort)? >>> >>> Thanks >>> -Nagu >>> >>> -----Original Message----- >>> From: Hans Nordebäck [mailto:[email protected]] >>> Sent: 30 July 2013 16:16 >>> To: Nagendra Kumar >>> Cc: Ramesh Babu Betham; [email protected]; Praveen >>> Malviya; Hans Feldt >>> Subject: Re: [PATCH 1 of 1] leap: ncs_os_process_execute_timed child >>> process takes too long time before exec (#514) >>> >>> Hi Nagu, I agree with Ramesh, e.g. in the child part: >>> >>> /* By default we close all inherited file descriptors in the >>> child */ >>> if (getenv("OPENSAF_KEEP_FD_OPEN_AFTER_FORK") == NULL) { >>> /* Close all inherited file descriptors */ >>> int i = sysconf(_SC_OPEN_MAX); >>> if (i == -1) { >>> syslog(LOG_ERR, "%s: sysconf failed - %s", >>> __FUNCTION__, strerror(errno)); >>> exit(EXIT_FAILURE); >>> } >>> for (i--; i >= 0; --i) >>> (void) close(i); /* close all descriptors */ >>> >>> /* Redirect standard files to /dev/null */ >>> if (freopen("/dev/null", "r", stdin) == NULL) >>> syslog(LOG_ERR, "%s: freopen stdin failed - %s", >>> __FUNCTION__, strerror(errno)); >>> if (freopen("/dev/null", "w", stdout) == NULL) >>> syslog(LOG_ERR, "%s: freopen stdout failed - %s", >>> __FUNCTION__, strerror(errno)); >>> if (freopen("/dev/null", "w", stderr) == NULL) >>> syslog(LOG_ERR, "%s: freopen stderr failed - %s", >>> __FUNCTION__, strerror(errno)); >>> } >>> >>> close are issued. >>> >>> /BR HansN >>> >>> On 07/30/13 12:34, Nagendra Kumar wrote: >>>> But where are file operations coming during fork, can you please >>>> explain ? >>>> >>>> Thanks >>>> -Nagu >>>> -----Original Message----- >>>> From: Ramesh Betham >>>> Sent: 30 July 2013 15:53 >>>> To: Nagendra Kumar >>>> Cc: Hans Nordebäck; [email protected]; Praveen >>>> Malviya; Hans Feldt >>>> Subject: Re: [PATCH 1 of 1] leap: ncs_os_process_execute_timed child >>>> process takes too long time before exec (#514) >>>> >>>> Nagu: Hans N might be pointing to the chances of hung of >>>> file-operation calls (esp. when some inconsistency happens with >>>> NFS). Just a guess, let Hans N confirm it. >>>> >>>> Thanks, >>>> Ramesh. >>>> >>>> On 7/30/2013 3:27 PM, Nagendra Kumar wrote: >>>>> Hi, >>>>> >>>>>>> regarding what can "hang" in the child part, e.g close of file >>>>>>> descriptors close of file descriptors. >>>>> When this can happen? After fork is successful, this shouldn't >>>>> happen. Can you please provide any example. >>>>> >>>>> Thanks >>>>> -Nagu >>>>> >>>>> -----Original Message----- >>>>> From: Hans Nordebäck [mailto:[email protected]] >>>>> Sent: 30 July 2013 12:53 >>>>> To: Nagendra Kumar >>>>> Cc: Hans Nordebäck; [email protected]; Praveen >>>>> Malviya; Ramesh Babu Betham; Hans Feldt >>>>> Subject: Re: [PATCH 1 of 1] leap: ncs_os_process_execute_timed child >>>>> process takes too long time before exec (#514) >>>>> >>>>> Hi Nagu, regarding what can "hang" in the child part, e.g close of >>>>> file descriptors. /BR HansN On 07/30/13 09:01, Hans Nordebäck wrote: >>>>>> Hi Nagu, >>>>>> >>>>>> On 07/30/13 08:54, Nagendra Kumar wrote: >>>>>>> Hi Hans N, >>>>>>> >>>>>>>>> 1. OPENSAF_CHILD_EXEC_TIME_TOLERANCE is the name of a new >>>>>>>>> environment variable where value is used as input to alarm, if >>>>>>>>> not set it is default 2 seconds. >>>>>>> Do we have some place holder for this variable for configuration >>>>>>> and are we going to add it in README for information. >>>>>> Perhaps the name isn't the best, but it should be handled as the >>>>>> other env variable I guess, e.g. "AVND_PM_MONITORING_RATE", etc. >>>>>>>>> if the child "hangs" before exec this extra coredump should give >>>>>>>>> information where/what is wrong. >>>>>>> This means that fork hangs, am I right ? If yes, then dump is not >>>>>>> going to provide any information as it is a system call, it can >>>>>>> only show, ithangs in fork. >>>>>> I don't think fork hangs as the parent part continues and later, >>>>>> with the help of ncs_exec_mod_hdlr, the parent detects that the >>>>>> child or "exec" has timed out, >>>>>> 10 sec in this case. But in this case the exec has not been >>>>>> performed. >>>>>>>>> After exec, it will work as usual >>>>>>> This confirms that we are only targeting fork to debug. >>>>>> Yes, the extra core dump will help troubleshooting. >>>>>> /BR HansN >>>>>>> Thanks >>>>>>> -Nagu >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Hans Nordebäck [mailto:[email protected]] >>>>>>> Sent: 30 July 2013 11:57 >>>>>>> To: Nagendra Kumar >>>>>>> Cc: [email protected]; Praveen Malviya; Ramesh >>>>>>> Babu Betham; Hans Feldt; Hans Nordebäck >>>>>>> Subject: RE: [PATCH 1 of 1] leap: ncs_os_process_execute_timed >>>>>>> child process takes too long time before exec (#514) >>>>>>> >>>>>>> Hi Nagu, >>>>>>> >>>>>>> 1. OPENSAF_CHILD_EXEC_TIME_TOLERANCE is the name of a new >>>>>>> environment variable where value is used as input to alarm, if not >>>>>>> set it is default 2 seconds. >>>>>>> 2. Yes you are right, in this particular case it is set to 10 sec, >>>>>>> that's why the env. variable above can be set. >>>>>>> 3. This alarm is just an additional precaution, at no extra cost, >>>>>>> to check the child part before the exec. After exec >>>>>>> it will work as usual but if the child "hangs" before exec >>>>>>> this extra coredump should give information where/what is wrong. >>>>>>> >>>>>>> /BR HansN >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Nagendra Kumar [mailto:[email protected]] >>>>>>> Sent: den 30 juli 2013 07:11 >>>>>>> To: Hans Nordebäck; Praveen Malviya; Hans Feldt; Ramesh Babu Betham >>>>>>> Cc: [email protected] >>>>>>> Subject: RE: [PATCH 1 of 1] leap: ncs_os_process_execute_timed >>>>>>> child process takes too long time before exec (#514) >>>>>>> >>>>>>> Hi Hans N, >>>>>>> For my understanding, can you please provide the below >>>>>>> information: >>>>>>> >>>>>>> 1. I can't find OPENSAF_CHILD_EXEC_TIME_TOLERANCE in opensaf >>>>>>> source code. >>>>>>> 2. I hope the child process is hung for more than >>>>>>> saAmfCtDefClcCliTimeout resulting in CLC time out. Am I right? >>>>>>> 3. Even we add assert in child process and we get core dump, but >>>>>>> it may not give any information as it got delayed because of >>>>>>> system issue. Are we targeting, which system call the child >>>>>>> process is hung? >>>>>>> >>>>>>> Thanks >>>>>>> -Nagu >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Hans Nordeback [mailto:[email protected]] >>>>>>> Sent: 22 July 2013 17:07 >>>>>>> To: Nagendra Kumar; Praveen Malviya; [email protected]; >>>>>>> Ramesh Babu Betham >>>>>>> Cc: [email protected] >>>>>>> Subject: [PATCH 1 of 1] leap: ncs_os_process_execute_timed child >>>>>>> process takes too long time before exec (#514) >>>>>>> >>>>>>> osaf/libs/core/leap/os_defs.c | 27 >>>>>>> +++++++++++++++++++++++++++ >>>>>>> 1 files changed, 27 insertions(+), 0 deletions(-) >>>>>>> >>>>>>> >>>>>>> amfnd calls ncs_os_process_execute_timed and the child process >>>>>>> takes too long time before exec, (10 sec timeout). An alarm is set >>>>>>> in the ncs_os_process_execute_timed child process. If timed out a >>>>>>> core dump will be produced to be able to trouble shoot. >>>>>>> >>>>>>> diff --git a/osaf/libs/core/leap/os_defs.c >>>>>>> b/osaf/libs/core/leap/os_defs.c >>>>>>> --- a/osaf/libs/core/leap/os_defs.c >>>>>>> +++ b/osaf/libs/core/leap/os_defs.c >>>>>>> @@ -65,6 +65,15 @@ bool gl_ncs_atomic_mtx_initialise = fals >>>>>>> * description of SOCK_CLOEXEC. */ >>>>>>> static pthread_mutex_t s_cloexec_mutex = >>>>>>> PTHREAD_MUTEX_INITIALIZER; >>>>>>> +/* >>>>>>> + * ALRM signal is used to detect if child process takes too long >>>>>>> time before exec. >>>>>>> + * >>>>>>> + * @param sig >>>>>>> + */ >>>>>>> +static void sigalrm_handler(int sig) { >>>>>>> + abort(); >>>>>>> +} >>>>>>> /*************************************************************************** >>>>>>> >>>>>>> >>>>>>> * >>>>>>> * uns64 >>>>>>> @@ -999,6 +1008,22 @@ uint32_t ncs_os_process_execute_timed(NC >>>>>>> osaf_mutex_lock_ordie(&s_cloexec_mutex); >>>>>>> if ((pid = fork()) == 0) { >>>>>>> + unsigned int alarm_time_sec; >>>>>>> + char* alarm_time; >>>>>>> + >>>>>>> + if (signal(SIGALRM, sigalrm_handler) == SIG_ERR) { >>>>>>> + LOG_ER("signal ALRM failed: %s", >>>>>>> strerror(errno)); >>>>>>> + } >>>>>>> + if ((alarm_time = >>>>>>> getenv("OPENSAF_CHILD_EXEC_TIME_TOLERANCE")) != NULL) { >>>>>>> + alarm_time_sec = strtol(alarm_time, >>>>>>> NULL, 0); >>>>>>> + } >>>>>>> + else { >>>>>>> + // default alarm timeout 2 seconds >>>>>>> + alarm_time_sec = 2; >>>>>>> + } >>>>>>> + >>>>>>> + alarm(alarm_time_sec); >>>>>>> + >>>>>>> /* >>>>>>> ** Make sure forked processes have default >>>>>>> scheduling class >>>>>>> ** independent of the callers scheduling class. >>>>>>> @@ -1054,6 +1079,8 @@ uint32_t ncs_os_process_execute_timed(NC >>>>>>> } >>>>>>> #endif >>>>>>> + alarm(0); >>>>>>> + >>>>>>> /* child part */ >>>>>>> if (execvp(req->i_script, req->i_argv) == -1) { >>>>>>> syslog(LOG_ERR, "%s: execvp '%s' failed - %s", >>>>>>> __FUNCTION__, req->i_script, strerror(errno)); >> > ------------------------------------------------------------------------------ Get 100% visibility into Java/.NET code with AppDynamics Lite! It's a free troubleshooting tool designed for production. Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
