You are not using sig_child() as intended. When used as intended, sig_child() will prevent shutdown until the child process has exited and has been reaped. The timing issues you're worried about should not exist.
-- Rocco Caputo <[email protected]> On Mar 24, 2014, at 11:44, albertocurro <[email protected]> wrote: > Hi Rocco, > > many thanks for your quick answer! Unfortunately, the provided solution only > works partially. I still have some cases where the "fork bomb" message is > here with us :( > > One of the cases is this one: under some configuration, an instance of nginx > is started, so our product writes the configuration file and starts the Nginx > instance pointing to that configuration file. BUT, if the configuration file > could not be written (directory does not exist, etc), then the error raises, > and I've not found any way to handle it: > > DEBUG - Created nginx temporary directory /opt/tmp/pull/instance1 > DEBUG - Created nginx configuration directory /opt/etc/pull/instance1 > DEBUG - Created nginx log directory /opt/log/pull/instance1 > DEBUG - creating nginx configfile for instance 1 in /opt/etc/pull/instance1 > === 13991 === !!! Kernel has 1 child process(es). > === 13991 === !!! At least one child process is still running when > POE::Kernel->run() is ready to return. > === 13991 === !!! Be sure to use sig_child() to reap child processes. > === 13991 === !!! In extreme cases, failure to reap child processes has > === 13991 === !!! resulted in a slow 'fork bomb' that has halted systems. > Could not open file: No such file or directory > > I've added a DIE handler in the main session to try to handle this: > > $sig_session = POE::Session->create( > inline_states => { > _start => sub { > $_[HEAP]{RELOADED} = 0; > $_[KERNEL]->sig(TERM => '_sigterm'); > $_[KERNEL]->sig(INT => '_sigterm'); > $_[KERNEL]->sig(DIE => '_sigterm'); > $_[KERNEL]->sig(nginx_reload => '_sig_nginx_reload'); > $_[KERNEL]->alias_set('sighandler'); > }, > _sigdie => sub { > print "Handling exception, calling stop"; > POE::Kernel->call($sig_session, '_stop'); > }, > _stop => sub { > # Reap any existing pid (# 1825119) > print "Handling stop"; > POE::Kernel->sig_child(); > use POSIX ":sys_wait_h"; > 1 while waitpid(WNOHANG, -1) > 0; > > # Clear signal handlers... > $_[KERNEL]->sig('TERM'); > > But, as said above, it's not working. Checking POE's code, I can see the > message lines are generated in Resources/Signals.pm, under > _data_sig_finalize() method (where POE is already doing the same you > recommended me, waiting for the pid). > > But _data_sig_finalize() method is called in Kernel.pm just after > unregistered all the signals (Kernel.pm => _finalize_kernel): > > my $self = shift; > > # Disable signal watching since there's now no place for them to go. > foreach ($self->_data_sig_get_safe_signals()) { > $self->loop_ignore_signal($_); > } > > # Remove the kernel session's signal watcher. > $self->_data_sig_remove($self->ID, "IDLE"); > > # The main loop is done, no matter which event library ran it. > # sig before loop so that it clears the signal_pipe file handler > $self->_data_sig_finalize(); > $self->loop_finalize(); > > Once here, none of my signal handlers in the main session instance would > work, as the signals have been unregistered. On an exception (die) while > POE::Kernel->run(), how could I handle it then?? > > Thanks a lot > Alberto > > > > > ---- Activado lun, 24 mar 2014 13:45:45 +0100 Rocco Caputo escribió ---- > >> Hi, Alberto. >> >> At program end time, POE runs a quick waitpid() check for child processes >> that may have leaked. This check was added after a bug report where POE >> locked up a server after several days of running. It turned out to be the >> reporter's application, but it was hard to debug. >> >> Your program seems to have created two processes that it didn't reap: PIDs >> 5373 and 5374. The ideal solution is to reap those processes before exiting. >> Your program can do this using POE::Kernel's sig_child() method. >> >> In some cases, a third-party library will create processes and not properly >> clean them up. It can be impossible to solve this case without modifying >> other people's code. >> >> If you just want to ignore the problem, this might do the trick. Put these >> lines in your last _stop handler. They should reap the processes you've >> leaked before POE's check: >> >> use POSIX ":sys_wait_h"; >> 1 while waitpid(WNOHANG, -1) > 0; >> >> It's a bit of a pain, but I think it's better to explicitly ignore the >> problem than for it to go unnoticed by default. >> >> Please let me know whether that resolves your problem. It may not. For >> example, the processes may still be open until an object is destroyed at >> global destruction time. >> >> -- >> Rocco Caputo >> >> On Mar 24, 2014, at 05:46, albertocurro wrote: >> >>> Guys, >>> >>> We have a product developed using POE as a base framework, with some other >>> tool libraries as log4perl; basically is a forward proxy, composed of >>> several modules, each one of them comprising a POE::Session; all of them >>> share an internal queue of tasks to be performed. Each module performs >>> several tasks on initialization, and if anything goes wrong, croak() is >>> called to stop the service -> this is considered ok, since croak() is only >>> called during initialization, when validation is being performed. >>> >>> The product is stable and works really fine, but recently I updated POE to >>> the latest version, and since then we can see this message in the logs: >>> >>> registering pdu failed: 263! >>> === 5267 === 5 -> on_handle (from Handler/StoreRemote.pm at 87) >>> === 5267 === 5 -> on_retry (from Handler/StoreRemote.pm at 141) >>> === 5267 === 9 -> on_handle (from Handler/StoreRemote.pm at 87) >>> === 5267 === 9 -> on_retry (from Handler/StoreRemote.pm at 141) >>> === 5267 === !!! Kernel has child processes. >>> === 5267 === !!! Stopped child process (PID 5373) reaped when >>> POE::Kernel->run() is ready to return. >>> === 5267 === !!! Stopped child process (PID 5374) reaped when >>> POE::Kernel->run() is ready to return. >>> === 5267 === !!! At least one child process is still running when >>> POE::Kernel->run() is ready to return. >>> === 5267 === !!! Be sure to use sig_child() to reap child processes. >>> === 5267 === !!! In extreme cases, failure to reap child processes has >>> === 5267 === !!! resulted in a slow 'fork bomb' that has halted systems. >>> mkdir /mnt/nfs99: Permission denied at Handler/Store.pm line 147 >>> >>> first lines and last line above are the errors itself, but this part is new >>> since the upgrading: >>> >>> === 5267 === !!! Kernel has child processes. >>> === 5267 === !!! Stopped child process (PID 5373) reaped when >>> POE::Kernel->run() is ready to return. >>> === 5267 === !!! Stopped child process (PID 5374) reaped when >>> POE::Kernel->run() is ready to return. >>> === 5267 === !!! At least one child process is still running when >>> POE::Kernel->run() is ready to return. >>> === 5267 === !!! Be sure to use sig_child() to reap child processes. >>> === 5267 === !!! In extreme cases, failure to reap child processes has >>> === 5267 === !!! resulted in a slow 'fork bomb' that has halted systems. >>> >>> I can see it everytime the service is stopped because of an unhandled >>> condition, even when POE's event loop has been already running for ours. It >>> was not visible before, and I can't get rid of it in any way. I've tried >>> different ways to avoid it with no luck. >>> >>> Any advice or alternative approach on this? >>> >>> Many thanks >>> Alberto
