My final comments (preceded by [Ramesh]) :-).
On 8/7/2013 2:21 PM, Hans Nordebäck wrote:
> Hi, sorry I forgot it, I add my comments below, (preceeded by HansN):
>
>> "
>> My 2 cents..
>> 1. Instead of LOG_ER() better call sigalrm_handler(0).
> HansN: I think it is good to have sigalrm active, but if signal fails it
> is not fatal and a log message should be enough, and then continue with exec
> instead of aborting.
>
>> + if (signal(SIGALRM, sigalrm_handler) == SIG_ERR) {
>> + LOG_ER("signal ALRM failed: %s", strerror(errno));
>> + }
[Ramesh]: Intention behind this comment is, as the patch is trying to
handle the case were "syslog" or any "fd" write blocks.. then generate a
core-dump. So having "LOG_ER", the significance of this patch is lost as
it may even block at LOG_ER.
>> 2. Why cannot the timeout logic can derive/bind directly from/to the timeout
>> value ("i_timeout_in_ms" of NCS_OS_PROC_EXECUTE_TIMED_INFO structure).
>> Say..,
>> Example: alarm_time_sec = (i_timeout_in_ms/2000); /* say
>> (half or lesser) of CLC-CLI timeout value in seconds. */
> HansN: I guess the time to run the child part before exec should be short
> and the timeout value should determined by "osaf" and not the configuration
> of an application, that also may vary from very short to very long
> timeout values.
>
>> 3. I prefer the kind of patch should go into a major release (4.4) not for
>> minor releases (4.3.x/4.2.x).
> HansN: ok, but we have several faults reported related to this, so it would
> be good to apply the patch.
[Ramesh]: Can push it, but this patch requires more testing.
> Thanks,
> Ramesh.
> "
>
> /BR HansN
>
> -----Original Message-----
> From: Ramesh Betham [mailto:[email protected]]
> Sent: den 7 augusti 2013 10:34
> To: Hans Nordebäck
> Cc: Nagendra Kumar; [email protected]; Hans Feldt; Anders
> Widell
> Subject: Re: SV: [PATCH 1 of 1] leap: ncs_os_process_execute_timed child
> process takes too long time before exec (#514)
>
> Can you please provide your opinion on my initial comments (mail-attached).
>
> Thanks,
> Ramesh.
>
> On 8/7/2013 1:43 PM, Hans Nordebäck wrote:
>> Hi, what is the conclusion, can this patch be acked? /BR HansN
>>
>> -----Original Message-----
>> From: Hans Nordebäck
>> Sent: den 6 augusti 2013 16:41
>> To: praveen malviya
>> Cc: Nagendra Kumar; Ramesh Babu Betham;
>> [email protected]; Hans Feldt; Anders Widell; Hans
>> Nordebäck
>> Subject: RE: SV: [PATCH 1 of 1] leap: ncs_os_process_execute_timed
>> child process takes too long time before exec (#514)
>>
>> the fault described below is related to ticket #532. /HansN
>>
>> -----Original Message-----
>> From: Hans Nordebäck
>> Sent: den 6 augusti 2013 16:28
>> To: praveen malviya
>> Cc: Nagendra Kumar; Ramesh Babu Betham;
>> [email protected]; Hans Feldt; Anders Widell; Hans
>> Nordebäck
>> Subject: RE: SV: [PATCH 1 of 1] leap: ncs_os_process_execute_timed
>> child process takes too long time before exec (#514)
>>
>> Hi, Yes, it seems most likely that the child is "hanging" somewhere before
>> exec, if it had crashed it should be a sigchld sent, which not seems to be
>> the case here.
>> Then the alarm patch needs to be applied I guess to detect where it is
>> "hanging".
>>
>> /BR HansN
>>
>> -----Original Message-----
>> From: praveen malviya [mailto:[email protected]]
>> Sent: den 6 augusti 2013 09:34
>> To: Hans Nordebäck
>> Cc: Nagendra Kumar; Ramesh Babu Betham;
>> [email protected]; Hans Feldt; Anders Widell
>> Subject: Re: SV: [PATCH 1 of 1] leap: ncs_os_process_execute_timed
>> child process takes too long time before exec (#514)
>>
>> Hi,
>> I added a new syslog entry(to debug) before execv and was able to
>> reproduce the problem.
>> The reason for the time out seems to be that syslog was *hung*, but
>> *after* writing the new log message into /var/log/messages.
>> So, there is a high possibility that the child process gets hung in syslog
>> or other file operations.
>> See bt below:
>> Thread 1 (Thread 0x7f030a16f700 (LWP 16331)):
>>
>> #0 0x00007f0308de4b1e in __lll_lock_wait_private () from
>> /lib64/libc.so.6
>> #1 0x00007f0308dd4398 in _L_lock_634 () from /lib64/libc.so.6
>> #2 0x00007f0308dd3e62 in __vsyslog_chk () from /lib64/libc.so.6
>> #3 0x00007f0308dd4320 in syslog () from /lib64/libc.so.6
>> #4 0x00007f0309d38feb in ncs_os_process_execute_timed
>> (req=0x7fffdcc52770, flag=false) at os_defs.c:1059
>> #5 0x0000000000416bd6 in avnd_comp_clc_cmd_execute (cb=0x6578a0,
>> comp=0x673840, cmd_type=AVND_COMP_CLC_CMD_TYPE_INSTANTIATE)
>> at avnd_clc.c:2687
>> #6 0x000000000041a630 in avnd_comp_clc_uninst_inst_hdler
>> (cb=0x6578a0,
>> comp=0x673840) at avnd_clc.c:1384
>> #7 0x000000000041865b in avnd_comp_clc_fsm_run (cb=0x6578a0, comp=0x673840,
>> ev=AVND_COMP_CLC_PRES_FSM_EV_INST)
>> at avnd_clc.c:874
>> #8 0x0000000000433f5f in avnd_su_pres_insting_compinst_hdler
>> (cb=0x6578a0, su=0x66fc70, comp=0x674e10) at avnd_susm.c:1759
>> #9 0x0000000000435764 in avnd_su_pres_fsm_run (cb=0x6578a0, su=0x66fc70,
>> comp=0x674e10,
>> ev=AVND_SU_PRES_FSM_EV_COMP_INSTANTIATED) at avnd_susm.c:1293
>> #10 0x000000000041748b in avnd_comp_clc_st_chng_prc (cb=0x6578a0,
>> comp=0x674e10, prv_st=SA_AMF_PRESENCE_INSTANTIATING,
>> final_st=SA_AMF_PRESENCE_INSTANTIATED) at avnd_clc.c:1313
>> #11 0x00000000004186ef in avnd_comp_clc_fsm_run (cb=0x6578a0, comp=0x674e10,
>> ev=AVND_COMP_CLC_PRES_FSM_EV_INST_SUCC)
>> at avnd_clc.c:897
>> #12 0x000000000041f400 in avnd_comp_reg_prc (cb=0x6578a0, comp=0x674e10,
>> pxy_comp=<optimized out>, reg=0x677528,
>> dest=0x677518) at avnd_comp.c:709
>> #13 0x000000000041fbd7 in avnd_evt_ava_comp_reg_evh (cb=0x6578a0,
>> evt=0x683530) at avnd_comp.c:212
>> #14 0x000000000042fcc8 in avnd_evt_process (evt=<optimized out>) at
>> avnd_proc.c:278
>> #15 avnd_main_process () at avnd_proc.c:219
>> #16 0x0000000000408835 in main (argc=2, argv=0x7fffdcc53588) at
>> amfnd_main.c:53
>>
>>
>> Thanks,
>> Praveen.
>>
>>
>> On 03-Aug-13 9:47 PM, Hans Nordebäck wrote:
>>> Hi,
>>>
>>> I think the problem with "child before exec" "hangs" and the problem where
>>> "child" at cleanup times out instantly, even though a 10 sec. timeout is
>>> set, are related.
>>>
>>> Some reasons why the child seems to hang could be:
>>>
>>> 1. Not run at all
>>> 2. "Hangs" in e.g. close.
>>> 3. Crashes, but then whe should have core dumps.
>>>
>>> but the problem with "hanging" and instantly times out could be:
>>>
>>> 4. When parent process adds the pid in the patricia tree,
>>> add_new_req_pid_in_list, ncs_exec_mod_hdlr is able to kill this child pid
>>> before it has started.
>>> A e.g. race condtion in this code could explain both these faults.
>>> I looked briefly in this code and one thing is the delivery of the
>>> SIGCHLD signal, any thread can get that signal, e.g amfnd, mds, timer or
>>> ncs_exec_mod_hdlr. I think a dedicated thread should handle SIGCHLD.
>>> I guess we should look closer to this part.
>>>
>>> /BR HansN
>>> ________________________________________
>>> Från: Hans Nordebäck
>>> Skickat: den 30 juli 2013 13:05
>>> Till: Nagendra Kumar
>>> Cc: Ramesh Babu Betham; [email protected]; Praveen
>>> Malviya; Hans Feldt
>>> Ämne: RE: [PATCH 1 of 1] leap: ncs_os_process_execute_timed child
>>> process takes too long time before exec (#514)
>>>
>>> Hi Nagu, I have tested the patch and simulated "hanging" in the child part,
>>> the core dump produced points out where.
>>> The system works as before, exept the core dump if it hangs in the
>>> child part. We have this patch running at a site waiting for it to trigger,
>>> so we don't know yet what is causing this and if the patch will help.
>>>
>>> /BR HansN
>>>
>>> -----Original Message-----
>>> From: Nagendra Kumar [mailto:[email protected]]
>>> Sent: den 30 juli 2013 13:05
>>> To: Hans Nordebäck
>>> Cc: Ramesh Babu Betham; [email protected]; Praveen
>>> Malviya; Hans Feldt
>>> Subject: RE: [PATCH 1 of 1] leap: ncs_os_process_execute_timed child
>>> process takes too long time before exec (#514)
>>>
>>> Hi Hans N,
>>>
>>> Close is not a sync call. I am still not convinced, these calls may get
>>> blocked because of NFS system issue.
>>> Have you tested by keeping abort in child process and see how amfnd or
>>> other services behaves(just to simulate abort)?
>>>
>>> Thanks
>>> -Nagu
>>>
>>> -----Original Message-----
>>> From: Hans Nordebäck [mailto:[email protected]]
>>> Sent: 30 July 2013 16:16
>>> To: Nagendra Kumar
>>> Cc: Ramesh Babu Betham; [email protected]; Praveen
>>> Malviya; Hans Feldt
>>> Subject: Re: [PATCH 1 of 1] leap: ncs_os_process_execute_timed child
>>> process takes too long time before exec (#514)
>>>
>>> Hi Nagu, I agree with Ramesh, e.g. in the child part:
>>>
>>> /* By default we close all inherited file descriptors in the child
>>> */
>>> if (getenv("OPENSAF_KEEP_FD_OPEN_AFTER_FORK") == NULL) {
>>> /* Close all inherited file descriptors */
>>> int i = sysconf(_SC_OPEN_MAX);
>>> if (i == -1) {
>>> syslog(LOG_ERR, "%s: sysconf failed - %s",
>>> __FUNCTION__, strerror(errno));
>>> exit(EXIT_FAILURE);
>>> }
>>> for (i--; i >= 0; --i)
>>> (void) close(i); /* close all descriptors */
>>>
>>> /* Redirect standard files to /dev/null */
>>> if (freopen("/dev/null", "r", stdin) == NULL)
>>> syslog(LOG_ERR, "%s: freopen stdin failed - %s",
>>> __FUNCTION__, strerror(errno));
>>> if (freopen("/dev/null", "w", stdout) == NULL)
>>> syslog(LOG_ERR, "%s: freopen stdout failed - %s",
>>> __FUNCTION__, strerror(errno));
>>> if (freopen("/dev/null", "w", stderr) == NULL)
>>> syslog(LOG_ERR, "%s: freopen stderr failed - %s",
>>> __FUNCTION__, strerror(errno));
>>> }
>>>
>>> close are issued.
>>>
>>> /BR HansN
>>>
>>> On 07/30/13 12:34, Nagendra Kumar wrote:
>>>> But where are file operations coming during fork, can you please explain ?
>>>>
>>>> Thanks
>>>> -Nagu
>>>> -----Original Message-----
>>>> From: Ramesh Betham
>>>> Sent: 30 July 2013 15:53
>>>> To: Nagendra Kumar
>>>> Cc: Hans Nordebäck; [email protected]; Praveen
>>>> Malviya; Hans Feldt
>>>> Subject: Re: [PATCH 1 of 1] leap: ncs_os_process_execute_timed child
>>>> process takes too long time before exec (#514)
>>>>
>>>> Nagu: Hans N might be pointing to the chances of hung of file-operation
>>>> calls (esp. when some inconsistency happens with NFS). Just a guess, let
>>>> Hans N confirm it.
>>>>
>>>> Thanks,
>>>> Ramesh.
>>>>
>>>> On 7/30/2013 3:27 PM, Nagendra Kumar wrote:
>>>>> Hi,
>>>>>
>>>>>>> regarding what can "hang" in the child part, e.g close of file
>>>>>>> descriptors close of file descriptors.
>>>>> When this can happen? After fork is successful, this shouldn't happen.
>>>>> Can you please provide any example.
>>>>>
>>>>> Thanks
>>>>> -Nagu
>>>>>
>>>>> -----Original Message-----
>>>>> From: Hans Nordebäck [mailto:[email protected]]
>>>>> Sent: 30 July 2013 12:53
>>>>> To: Nagendra Kumar
>>>>> Cc: Hans Nordebäck; [email protected]; Praveen
>>>>> Malviya; Ramesh Babu Betham; Hans Feldt
>>>>> Subject: Re: [PATCH 1 of 1] leap: ncs_os_process_execute_timed
>>>>> child process takes too long time before exec (#514)
>>>>>
>>>>> Hi Nagu, regarding what can "hang" in the child part, e.g close of file
>>>>> descriptors. /BR HansN On 07/30/13 09:01, Hans Nordebäck wrote:
>>>>>> Hi Nagu,
>>>>>>
>>>>>> On 07/30/13 08:54, Nagendra Kumar wrote:
>>>>>>> Hi Hans N,
>>>>>>>
>>>>>>>>> 1. OPENSAF_CHILD_EXEC_TIME_TOLERANCE is the name of a new
>>>>>>>>> environment variable where value is used as input to alarm, if
>>>>>>>>> not set it is default 2 seconds.
>>>>>>> Do we have some place holder for this variable for configuration
>>>>>>> and are we going to add it in README for information.
>>>>>> Perhaps the name isn't the best, but it should be handled as the
>>>>>> other env variable I guess, e.g. "AVND_PM_MONITORING_RATE", etc.
>>>>>>>>> if the child "hangs" before exec this extra coredump should
>>>>>>>>> give information where/what is wrong.
>>>>>>> This means that fork hangs, am I right ? If yes, then dump is not
>>>>>>> going to provide any information as it is a system call, it can
>>>>>>> only show, ithangs in fork.
>>>>>> I don't think fork hangs as the parent part continues and later,
>>>>>> with the help of ncs_exec_mod_hdlr, the parent detects that the
>>>>>> child or "exec" has timed out,
>>>>>> 10 sec in this case. But in this case the exec has not been performed.
>>>>>>>>> After exec, it will work as usual
>>>>>>> This confirms that we are only targeting fork to debug.
>>>>>> Yes, the extra core dump will help troubleshooting.
>>>>>> /BR HansN
>>>>>>> Thanks
>>>>>>> -Nagu
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Hans Nordebäck [mailto:[email protected]]
>>>>>>> Sent: 30 July 2013 11:57
>>>>>>> To: Nagendra Kumar
>>>>>>> Cc: [email protected]; Praveen Malviya; Ramesh
>>>>>>> Babu Betham; Hans Feldt; Hans Nordebäck
>>>>>>> Subject: RE: [PATCH 1 of 1] leap: ncs_os_process_execute_timed
>>>>>>> child process takes too long time before exec (#514)
>>>>>>>
>>>>>>> Hi Nagu,
>>>>>>>
>>>>>>> 1. OPENSAF_CHILD_EXEC_TIME_TOLERANCE is the name of a new
>>>>>>> environment variable where value is used as input to alarm, if
>>>>>>> not set it is default 2 seconds.
>>>>>>> 2. Yes you are right, in this particular case it is set to 10
>>>>>>> sec, that's why the env. variable above can be set.
>>>>>>> 3. This alarm is just an additional precaution, at no extra cost,
>>>>>>> to check the child part before the exec. After exec
>>>>>>> it will work as usual but if the child "hangs" before
>>>>>>> exec this extra coredump should give information where/what is wrong.
>>>>>>>
>>>>>>> /BR HansN
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Nagendra Kumar [mailto:[email protected]]
>>>>>>> Sent: den 30 juli 2013 07:11
>>>>>>> To: Hans Nordebäck; Praveen Malviya; Hans Feldt; Ramesh Babu
>>>>>>> Betham
>>>>>>> Cc: [email protected]
>>>>>>> Subject: RE: [PATCH 1 of 1] leap: ncs_os_process_execute_timed
>>>>>>> child process takes too long time before exec (#514)
>>>>>>>
>>>>>>> Hi Hans N,
>>>>>>> For my understanding, can you please provide the
>>>>>>> below
>>>>>>> information:
>>>>>>>
>>>>>>> 1. I can't find OPENSAF_CHILD_EXEC_TIME_TOLERANCE in opensaf
>>>>>>> source code.
>>>>>>> 2. I hope the child process is hung for more than
>>>>>>> saAmfCtDefClcCliTimeout resulting in CLC time out. Am I right?
>>>>>>> 3. Even we add assert in child process and we get core dump, but
>>>>>>> it may not give any information as it got delayed because of
>>>>>>> system issue. Are we targeting, which system call the
>>>>>>> child process is hung?
>>>>>>>
>>>>>>> Thanks
>>>>>>> -Nagu
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Hans Nordeback [mailto:[email protected]]
>>>>>>> Sent: 22 July 2013 17:07
>>>>>>> To: Nagendra Kumar; Praveen Malviya; [email protected];
>>>>>>> Ramesh Babu Betham
>>>>>>> Cc: [email protected]
>>>>>>> Subject: [PATCH 1 of 1] leap: ncs_os_process_execute_timed child
>>>>>>> process takes too long time before exec (#514)
>>>>>>>
>>>>>>> osaf/libs/core/leap/os_defs.c | 27 +++++++++++++++++++++++++++
>>>>>>> 1 files changed, 27 insertions(+), 0 deletions(-)
>>>>>>>
>>>>>>>
>>>>>>> amfnd calls ncs_os_process_execute_timed and the child process
>>>>>>> takes too long time before exec, (10 sec timeout). An alarm is
>>>>>>> set in the ncs_os_process_execute_timed child process. If timed
>>>>>>> out a core dump will be produced to be able to trouble shoot.
>>>>>>>
>>>>>>> diff --git a/osaf/libs/core/leap/os_defs.c
>>>>>>> b/osaf/libs/core/leap/os_defs.c
>>>>>>> --- a/osaf/libs/core/leap/os_defs.c
>>>>>>> +++ b/osaf/libs/core/leap/os_defs.c
>>>>>>> @@ -65,6 +65,15 @@ bool gl_ncs_atomic_mtx_initialise = fals
>>>>>>> * description of SOCK_CLOEXEC. */
>>>>>>> static pthread_mutex_t s_cloexec_mutex =
>>>>>>> PTHREAD_MUTEX_INITIALIZER;
>>>>>>> +/*
>>>>>>> + * ALRM signal is used to detect if child process takes too long
>>>>>>> time before exec.
>>>>>>> + *
>>>>>>> + * @param sig
>>>>>>> + */
>>>>>>> +static void sigalrm_handler(int sig) {
>>>>>>> + abort();
>>>>>>> +}
>>>>>>> /***************************************************************************
>>>>>>> *
>>>>>>> * uns64
>>>>>>> @@ -999,6 +1008,22 @@ uint32_t ncs_os_process_execute_timed(NC
>>>>>>> osaf_mutex_lock_ordie(&s_cloexec_mutex);
>>>>>>> if ((pid = fork()) == 0) {
>>>>>>> + unsigned int alarm_time_sec;
>>>>>>> + char* alarm_time;
>>>>>>> +
>>>>>>> + if (signal(SIGALRM, sigalrm_handler) == SIG_ERR) {
>>>>>>> + LOG_ER("signal ALRM failed: %s",
>>>>>>> strerror(errno));
>>>>>>> + }
>>>>>>> + if ((alarm_time =
>>>>>>> getenv("OPENSAF_CHILD_EXEC_TIME_TOLERANCE")) != NULL) {
>>>>>>> + alarm_time_sec = strtol(alarm_time, NULL, 0);
>>>>>>> + }
>>>>>>> + else {
>>>>>>> + // default alarm timeout 2 seconds
>>>>>>> + alarm_time_sec = 2;
>>>>>>> + }
>>>>>>> +
>>>>>>> + alarm(alarm_time_sec);
>>>>>>> +
>>>>>>> /*
>>>>>>> ** Make sure forked processes have default scheduling
>>>>>>> class
>>>>>>> ** independent of the callers scheduling class.
>>>>>>> @@ -1054,6 +1079,8 @@ uint32_t ncs_os_process_execute_timed(NC
>>>>>>> }
>>>>>>> #endif
>>>>>>> + alarm(0);
>>>>>>> +
>>>>>>> /* child part */
>>>>>>> if (execvp(req->i_script, req->i_argv) == -1) {
>>>>>>> syslog(LOG_ERR, "%s: execvp '%s' failed - %s",
>>>>>>> __FUNCTION__, req->i_script, strerror(errno));
------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel