Asunto: Re: Slow fork bomb message in latest version of POE

albertocurro Mon, 24 Mar 2014 09:17:48 -0700

Hi,

 Sorry, but I don't catch what you exactly mean with "not using sig_child() as 
intended". Do you mean calling it from the main session so each child process 
will be closed properly?


 The issue I have is how to handle unexpected exceptions. Seems they are thrown 
and raised without control, killing POE's kernel before in the way. I could be 
thinking in the timing in the wrong way, though...

 Alberto

---- Activado lun, 24 mar 2014 16:59:49 +0100 Rocco Caputo<[email protected]> 
escribió ---- 

 > You are not using sig_child() as intended.  When used as intended, 
 > sig_child() will prevent shutdown until the child process has exited and has 
 > been reaped.  The timing issues you're worried about should not exist. 
 >  
 > --  
 > Rocco Caputo <[email protected]> 
 >  
 > On Mar 24, 2014, at 11:44, albertocurro <[email protected]> wrote: 
 >  
 > > Hi Rocco, 
 > >  
 > > many thanks for your quick answer! Unfortunately, the provided solution 
 > > only works partially. I still have some cases where the "fork bomb" 
 > > message is here with us :( 
 > >  
 > >  One of the cases is this one: under some configuration, an instance of 
 > > nginx is started, so our product writes the configuration file and starts 
 > > the Nginx instance pointing to that configuration file. BUT, if the 
 > > configuration file could not be written (directory does not exist, etc), 
 > > then the error raises, and I've not found any way to handle it: 
 > >  
 > > DEBUG - Created nginx temporary directory /opt/tmp/pull/instance1 
 > > DEBUG - Created nginx configuration directory /opt/etc/pull/instance1 
 > > DEBUG - Created nginx log directory /opt/log/pull/instance1 
 > > DEBUG - creating nginx configfile for instance 1 in 
 > > /opt/etc/pull/instance1 
 > > === 13991 === !!! Kernel has 1 child process(es). 
 > > === 13991 === !!! At least one child process is still running when 
 > > POE::Kernel->run() is ready to return. 
 > > === 13991 === !!! Be sure to use sig_child() to reap child processes. 
 > > === 13991 === !!! In extreme cases, failure to reap child processes has 
 > > === 13991 === !!! resulted in a slow 'fork bomb' that has halted systems. 
 > > Could not open file: No such file or directory 
 > >  
 > > I've added a DIE handler in the main session to try to handle this: 
 > >  
 > > $sig_session = POE::Session->create( 
 > >    inline_states => { 
 > >        _start => sub { 
 > >            $_[HEAP]{RELOADED} = 0; 
 > >            $_[KERNEL]->sig(TERM => '_sigterm'); 
 > >            $_[KERNEL]->sig(INT => '_sigterm'); 
 > >            $_[KERNEL]->sig(DIE => '_sigterm'); 
 > >            $_[KERNEL]->sig(nginx_reload => '_sig_nginx_reload'); 
 > >            $_[KERNEL]->alias_set('sighandler'); 
 > >        }, 
 > >        _sigdie => sub { 
 > >            print "Handling exception, calling stop"; 
 > >            POE::Kernel->call($sig_session, '_stop'); 
 > >        }, 
 > >        _stop => sub { 
 > >            # Reap any existing pid (# 1825119) 
 > >            print "Handling stop"; 
 > >            POE::Kernel->sig_child(); 
 > >            use POSIX ":sys_wait_h"; 
 > >            1 while waitpid(WNOHANG, -1) > 0; 
 > >  
 > >            # Clear signal handlers... 
 > >            $_[KERNEL]->sig('TERM'); 
 > >  
 > > But, as said above, it's not working. Checking POE's code, I can see the 
 > > message lines are generated in Resources/Signals.pm, under 
 > > _data_sig_finalize() method (where POE is already doing the same you 
 > > recommended me, waiting for the pid). 
 > >  
 > > But _data_sig_finalize() method is called in Kernel.pm just after 
 > > unregistered all the signals (Kernel.pm => _finalize_kernel): 
 > >  
 > > my $self = shift; 
 > >  
 > >  # Disable signal watching since there's now no place for them to go. 
 > >  foreach ($self->_data_sig_get_safe_signals()) { 
 > >    $self->loop_ignore_signal($_); 
 > >  } 
 > >  
 > >  # Remove the kernel session's signal watcher. 
 > >  $self->_data_sig_remove($self->ID, "IDLE"); 
 > >  
 > >  # The main loop is done, no matter which event library ran it. 
 > >  # sig before loop so that it clears the signal_pipe file handler 
 > >  $self->_data_sig_finalize(); 
 > >  $self->loop_finalize(); 
 > >  
 > > Once here, none of my signal handlers in the main session instance would 
 > > work, as the signals have been unregistered. On an exception (die) while 
 > > POE::Kernel->run(), how could I handle it then?? 
 > >  
 > > Thanks a lot 
 > > Alberto 
 > >  
 > >  
 > >  
 > >  
 > > ---- Activado lun, 24 mar 2014 13:45:45 +0100 Rocco Caputo  escribió ----  
 > >  
 > >> Hi, Alberto.  
 > >>  
 > >> At program end time, POE runs a quick waitpid() check for child processes 
 > >> that may have leaked. This check was added after a bug report where POE 
 > >> locked up a server after several days of running. It turned out to be the 
 > >> reporter's application, but it was hard to debug.  
 > >>  
 > >> Your program seems to have created two processes that it didn't reap: 
 > >> PIDs 5373 and 5374. The ideal solution is to reap those processes before 
 > >> exiting. Your program can do this using POE::Kernel's sig_child() method. 
 > >>  
 > >>  
 > >> In some cases, a third-party library will create processes and not 
 > >> properly clean them up. It can be impossible to solve this case without 
 > >> modifying other people's code.  
 > >>  
 > >> If you just want to ignore the problem, this might do the trick. Put 
 > >> these lines in your last _stop handler. They should reap the processes 
 > >> you've leaked before POE's check:  
 > >>  
 > >> use POSIX ":sys_wait_h";  
 > >> 1 while waitpid(WNOHANG, -1) > 0;  
 > >>  
 > >> It's a bit of a pain, but I think it's better to explicitly ignore the 
 > >> problem than for it to go unnoticed by default.  
 > >>  
 > >> Please let me know whether that resolves your problem. It may not. For 
 > >> example, the processes may still be open until an object is destroyed at 
 > >> global destruction time.  
 > >>  
 > >> --  
 > >> Rocco Caputo   
 > >>  
 > >> On Mar 24, 2014, at 05:46, albertocurro  wrote:  
 > >>  
 > >>> Guys,  
 > >>>  
 > >>> We have a product developed using POE as a base framework, with some 
 > >>> other tool libraries as log4perl; basically is a forward proxy, composed 
 > >>> of several modules, each one of them comprising a POE::Session; all of 
 > >>> them share an internal queue of tasks to be performed. Each module 
 > >>> performs several tasks on initialization, and if anything goes wrong, 
 > >>> croak() is called to stop the service -> this is considered ok, since 
 > >>> croak() is only called during initialization, when validation is being 
 > >>> performed.  
 > >>>  
 > >>> The product is stable and works really fine, but recently I updated POE 
 > >>> to the latest version, and since then we can see this message in the 
 > >>> logs:  
 > >>>  
 > >>> registering pdu failed: 263!  
 > >>> === 5267 === 5 -> on_handle (from Handler/StoreRemote.pm at 87)  
 > >>> === 5267 === 5 -> on_retry (from Handler/StoreRemote.pm at 141)  
 > >>> === 5267 === 9 -> on_handle (from Handler/StoreRemote.pm at 87)  
 > >>> === 5267 === 9 -> on_retry (from Handler/StoreRemote.pm at 141)  
 > >>> === 5267 === !!! Kernel has child processes.  
 > >>> === 5267 === !!! Stopped child process (PID 5373) reaped when 
 > >>> POE::Kernel->run() is ready to return.  
 > >>> === 5267 === !!! Stopped child process (PID 5374) reaped when 
 > >>> POE::Kernel->run() is ready to return.  
 > >>> === 5267 === !!! At least one child process is still running when 
 > >>> POE::Kernel->run() is ready to return.  
 > >>> === 5267 === !!! Be sure to use sig_child() to reap child processes.  
 > >>> === 5267 === !!! In extreme cases, failure to reap child processes has  
 > >>> === 5267 === !!! resulted in a slow 'fork bomb' that has halted systems. 
 > >>>  
 > >>> mkdir /mnt/nfs99: Permission denied at Handler/Store.pm line 147  
 > >>>  
 > >>> first lines and last line above are the errors itself, but this part is 
 > >>> new since the upgrading:  
 > >>>  
 > >>> === 5267 === !!! Kernel has child processes.  
 > >>> === 5267 === !!! Stopped child process (PID 5373) reaped when 
 > >>> POE::Kernel->run() is ready to return.  
 > >>> === 5267 === !!! Stopped child process (PID 5374) reaped when 
 > >>> POE::Kernel->run() is ready to return.  
 > >>> === 5267 === !!! At least one child process is still running when 
 > >>> POE::Kernel->run() is ready to return.  
 > >>> === 5267 === !!! Be sure to use sig_child() to reap child processes.  
 > >>> === 5267 === !!! In extreme cases, failure to reap child processes has  
 > >>> === 5267 === !!! resulted in a slow 'fork bomb' that has halted systems. 
 > >>>  
 > >>>  
 > >>> I can see it everytime the service is stopped because of an unhandled 
 > >>> condition, even when POE's event loop has been already running for ours. 
 > >>> It was not visible before, and I can't get rid of it in any way. I've 
 > >>> tried different ways to avoid it with no luck.  
 > >>>  
 > >>> Any advice or alternative approach on this?  
 > >>>  
 > >>> Many thanks  
 > >>> Alberto 
 >  
 >

Asunto: Re: Slow fork bomb message in latest version of POE

Reply via email to