Hi Ramesh,

today there are no differentiation if it hangs before exec (opensaf problem 
related?) or after exec (application problem related?).
The process will timeout and the process will be later be killed by the 
give_exec_mod_cb, this is how it works today. With this
patch it will work the same, but we will have a core dump to troubleshoot the 
child part before exec if that part times out. /BR HansN

From: Ramesh Betham [mailto:[email protected]]
Sent: den 30 juli 2013 12:45
To: Nagendra Kumar
Cc: Hans Nordebäck; [email protected]; Praveen Malviya; Hans 
Feldt
Subject: Re: [PATCH 1 of 1] leap: ncs_os_process_execute_timed child process 
takes too long time before exec (#514)

It may not be exactly with fork(), after fork() say..

                        if (freopen("/dev/null", "r", stdin) == NULL)
                                syslog(LOG_ERR, "%s: freopen stdin failed - 
%s", __FUNCTION__, strerror(errno));
                        if (freopen("/dev/null", "w", stdout) == NULL)
                                syslog(LOG_ERR, "%s: freopen stdout failed - 
%s", __FUNCTION__, strerror(errno));
                        if (freopen("/dev/null", "w", stderr) == NULL)

But if this is the scenario been observed (a hung), let the CLC-CLI timeout and 
take subsequent action rather than forcing to abort. What do you say?

Thanks,
Ramesh.
On 7/30/2013 4:04 PM, Nagendra Kumar wrote:

But where are file operations coming during fork, can you please explain ?



Thanks

-Nagu

-----Original Message-----

From: Ramesh Betham

Sent: 30 July 2013 15:53

To: Nagendra Kumar

Cc: Hans Nordebäck; 
[email protected]<mailto:[email protected]>;
 Praveen Malviya; Hans Feldt

Subject: Re: [PATCH 1 of 1] leap: ncs_os_process_execute_timed child process 
takes too long time before exec (#514)



Nagu: Hans N might be pointing to the chances of hung of file-operation calls 
(esp.  when some inconsistency happens with NFS).  Just a guess, let Hans N 
confirm it.



Thanks,

Ramesh.



On 7/30/2013 3:27 PM, Nagendra Kumar wrote:

Hi,



regarding what can "hang" in the child part, e.g close of file descriptors 
close of file descriptors.

When this can happen? After fork is successful, this shouldn't happen. Can you 
please provide any example.



Thanks

-Nagu



-----Original Message-----

From: Hans Nordebäck [mailto:[email protected]]

Sent: 30 July 2013 12:53

To: Nagendra Kumar

Cc: Hans Nordebäck; 
[email protected]<mailto:[email protected]>;
 Praveen

Malviya; Ramesh Babu Betham; Hans Feldt

Subject: Re: [PATCH 1 of 1] leap: ncs_os_process_execute_timed child

process takes too long time before exec (#514)



Hi Nagu, regarding what can "hang" in the child part, e.g close of file 
descriptors. /BR HansN On 07/30/13 09:01, Hans Nordebäck wrote:

Hi Nagu,



On 07/30/13 08:54, Nagendra Kumar wrote:

Hi Hans N,



1. OPENSAF_CHILD_EXEC_TIME_TOLERANCE is the name of a new

environment variable where value is used as input to alarm,  if

not set it is default 2 seconds.

Do we have some place holder for this variable for configuration and

are we going to add it in README for information.

Perhaps the name isn't the best, but it should be handled as the

other env variable I guess, e.g. "AVND_PM_MONITORING_RATE", etc.

if the child  "hangs" before exec this extra coredump should give

information  where/what is wrong.

This means that fork hangs, am I right ? If yes, then dump is not

going to provide any information as it is a system call, it can only

show, ithangs in fork.

I don't think fork hangs as the parent part continues and later, with

the help of ncs_exec_mod_hdlr,  the parent detects that the child or

"exec" has timed out,

10 sec in this case. But in this case the exec has not been performed.

After exec, it will work as usual

This confirms that we are only targeting fork to debug.

Yes, the extra core dump will help troubleshooting.

/BR HansN

Thanks

-Nagu



-----Original Message-----

From: Hans Nordebäck [mailto:[email protected]]

Sent: 30 July 2013 11:57

To: Nagendra Kumar

Cc: 
[email protected]<mailto:[email protected]>;
 Praveen Malviya; Ramesh

Babu Betham; Hans Feldt; Hans Nordebäck

Subject: RE: [PATCH 1 of 1] leap: ncs_os_process_execute_timed child

process takes too long time before exec (#514)



Hi Nagu,



1. OPENSAF_CHILD_EXEC_TIME_TOLERANCE is the name of a new

environment variable where value is used as input to alarm,  if not

set it is default 2 seconds.

2. Yes you are right, in this particular case it is set to 10 sec,

that's why the env. variable above can be set.

3. This alarm is just an additional precaution, at no extra cost,

to check the child part before the exec.  After exec

       it will work as usual but if the child  "hangs" before exec

this extra coredump should give information  where/what is wrong.



/BR HansN



-----Original Message-----

From: Nagendra Kumar [mailto:[email protected]]

Sent: den 30 juli 2013 07:11

To: Hans Nordebäck; Praveen Malviya; Hans Feldt; Ramesh Babu Betham

Cc: 
[email protected]<mailto:[email protected]>

Subject: RE: [PATCH 1 of 1] leap: ncs_os_process_execute_timed child

process takes too long time before exec (#514)



Hi Hans N,

         For my understanding, can you please provide the below

information:



1.    I can't find OPENSAF_CHILD_EXEC_TIME_TOLERANCE in opensaf

source code.

2.    I hope the child process is hung for more than

saAmfCtDefClcCliTimeout resulting in CLC time out. Am I right?

3.    Even we add assert in child process and we get core dump, but

it may not give any information as it got delayed because of

     system issue. Are we targeting, which system call the child

process is hung?



Thanks

-Nagu



-----Original Message-----

From: Hans Nordeback [mailto:[email protected]]

Sent: 22 July 2013 17:07

To: Nagendra Kumar; Praveen Malviya; 
[email protected]<mailto:[email protected]>; Ramesh

Babu Betham

Cc: 
[email protected]<mailto:[email protected]>

Subject: [PATCH 1 of 1] leap: ncs_os_process_execute_timed child

process takes too long time before exec (#514)



   osaf/libs/core/leap/os_defs.c |  27 +++++++++++++++++++++++++++

   1 files changed, 27 insertions(+), 0 deletions(-)





amfnd calls ncs_os_process_execute_timed and the child process takes

too long time before exec, (10 sec timeout). An alarm is set in the

ncs_os_process_execute_timed child process. If timed out a core dump

will be produced to be able to trouble shoot.



diff --git a/osaf/libs/core/leap/os_defs.c

b/osaf/libs/core/leap/os_defs.c

--- a/osaf/libs/core/leap/os_defs.c

+++ b/osaf/libs/core/leap/os_defs.c

@@ -65,6 +65,15 @@ bool gl_ncs_atomic_mtx_initialise = fals

    * description of SOCK_CLOEXEC. */

   static pthread_mutex_t s_cloexec_mutex = PTHREAD_MUTEX_INITIALIZER;

   +/*

+ * ALRM signal is used to detect if child process takes too long

time before exec.

+ *

+ * @param sig

+ */

+static void sigalrm_handler(int sig) {

+    abort();

+}

/***************************************************************************

    *

    * uns64

@@ -999,6 +1008,22 @@ uint32_t ncs_os_process_execute_timed(NC

       osaf_mutex_lock_ordie(&s_cloexec_mutex);

         if ((pid = fork()) == 0) {

+                unsigned int alarm_time_sec;

+                char* alarm_time;

+

+                if (signal(SIGALRM, sigalrm_handler) == SIG_ERR) {

+                        LOG_ER("signal ALRM failed: %s",

strerror(errno));

+                }

+                if ((alarm_time =

getenv("OPENSAF_CHILD_EXEC_TIME_TOLERANCE")) != NULL) {

+                        alarm_time_sec = strtol(alarm_time, NULL, 0);

+                }

+                else {

+                        // default alarm timeout 2 seconds

+                        alarm_time_sec = 2;

+                }

+

+                alarm(alarm_time_sec);

+

           /*

            ** Make sure forked processes have default scheduling class

            ** independent of the callers scheduling class.

@@ -1054,6 +1079,8 @@ uint32_t ncs_os_process_execute_timed(NC

           }

   #endif

   +                alarm(0);

+

           /* child part */

           if (execvp(req->i_script, req->i_argv) == -1) {

               syslog(LOG_ERR, "%s: execvp '%s' failed - %s",

__FUNCTION__, req->i_script, strerror(errno));



------------------------------------------------------------------------------
Get your SQL database under version control now!
Version control is standard for application code, but databases havent 
caught up. So what steps can you take to put your SQL databases under 
version control? Why should you start doing it? Read more to find out.
http://pubads.g.doubleclick.net/gampad/clk?id=49501711&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to