Hi again, sorry! from the code below, there's a mistake as DIE signal is linked to _sigterm, while is really pointing to _sigdie; just to clarify it before someone says "it can't work, you are pointing to the wrong method!" :D Alberto
---- Activado lun, 24 mar 2014 16:44:36 +0100 albertocurro<albertocu...@zoho.com> escribió ---- > Hi Rocco, > > many thanks for your quick answer! Unfortunately, the provided solution > only works partially. I still have some cases where the "fork bomb" message > is here with us :( > > One of the cases is this one: under some configuration, an instance of > nginx is started, so our product writes the configuration file and starts > the Nginx instance pointing to that configuration file. BUT, if the > configuration file could not be written (directory does not exist, etc), > then the error raises, and I've not found any way to handle it: > > DEBUG - Created nginx temporary directory /opt/tmp/pull/instance1 > DEBUG - Created nginx configuration directory /opt/etc/pull/instance1 > DEBUG - Created nginx log directory /opt/log/pull/instance1 > DEBUG - creating nginx configfile for instance 1 in /opt/etc/pull/instance1 > === 13991 === !!! Kernel has 1 child process(es). > === 13991 === !!! At least one child process is still running when > POE::Kernel->run() is ready to return. > === 13991 === !!! Be sure to use sig_child() to reap child processes. > === 13991 === !!! In extreme cases, failure to reap child processes has > === 13991 === !!! resulted in a slow 'fork bomb' that has halted systems. > Could not open file: No such file or directory > > I've added a DIE handler in the main session to try to handle this: > > $sig_session = POE::Session->create( > inline_states => { > _start => sub { > $_[HEAP]{RELOADED} = 0; > $_[KERNEL]->sig(TERM => '_sigterm'); > $_[KERNEL]->sig(INT => '_sigterm'); > $_[KERNEL]->sig(DIE => '_sigterm'); > $_[KERNEL]->sig(nginx_reload => '_sig_nginx_reload'); > $_[KERNEL]->alias_set('sighandler'); > }, > _sigdie => sub { > print "Handling exception, calling stop"; > POE::Kernel->call($sig_session, '_stop'); > }, > _stop => sub { > # Reap any existing pid (# 1825119) > print "Handling stop"; > POE::Kernel->sig_child(); > use POSIX ":sys_wait_h"; > 1 while waitpid(WNOHANG, -1) > 0; > > # Clear signal handlers... > $_[KERNEL]->sig('TERM'); > > But, as said above, it's not working. Checking POE's code, I can see the > message lines are generated in Resources/Signals.pm, under > _data_sig_finalize() method (where POE is already doing the same you > recommended me, waiting for the pid). > > But _data_sig_finalize() method is called in Kernel.pm just after > unregistered all the signals (Kernel.pm => _finalize_kernel): > > my $self = shift; > > # Disable signal watching since there's now no place for them to go. > foreach ($self->_data_sig_get_safe_signals()) { > $self->loop_ignore_signal($_); > } > > # Remove the kernel session's signal watcher. > $self->_data_sig_remove($self->ID, "IDLE"); > > # The main loop is done, no matter which event library ran it. > # sig before loop so that it clears the signal_pipe file handler > $self->_data_sig_finalize(); > $self->loop_finalize(); > > Once here, none of my signal handlers in the main session instance would > work, as the signals have been unregistered. On an exception (die) while > POE::Kernel->run(), how could I handle it then?? > > Thanks a lot > Alberto > > > > > ---- Activado lun, 24 mar 2014 13:45:45 +0100 Rocco Caputo escribió ---- > > >Hi, Alberto. > > > >At program end time, POE runs a quick waitpid() check for child processes > >that may have leaked. This check was added after a bug report where POE > >locked up a server after several days of running. It turned out to be the > >reporter's application, but it was hard to debug. > > > >Your program seems to have created two processes that it didn't reap: PIDs > >5373 and 5374. The ideal solution is to reap those processes before > >exiting. Your program can do this using POE::Kernel's sig_child() method. > > > >In some cases, a third-party library will create processes and not properly > >clean them up. It can be impossible to solve this case without modifying > >other people's code. > > > >If you just want to ignore the problem, this might do the trick. Put these > >lines in your last _stop handler. They should reap the processes you've > >leaked before POE's check: > > > >use POSIX ":sys_wait_h"; > >1 while waitpid(WNOHANG, -1) > 0; > > > >It's a bit of a pain, but I think it's better to explicitly ignore the > >problem than for it to go unnoticed by default. > > > >Please let me know whether that resolves your problem. It may not. For > >example, the processes may still be open until an object is destroyed at > >global destruction time. > > > >-- > >Rocco Caputo > > > >On Mar 24, 2014, at 05:46, albertocurro wrote: > > > >> Guys, > >> > >> We have a product developed using POE as a base framework, with some > >> other tool libraries as log4perl; basically is a forward proxy, composed > >> of several modules, each one of them comprising a POE::Session; all of > >> them share an internal queue of tasks to be performed. Each module > >> performs several tasks on initialization, and if anything goes wrong, > >> croak() is called to stop the service -> this is considered ok, since > >> croak() is only called during initialization, when validation is being > >> performed. > >> > >> The product is stable and works really fine, but recently I updated POE > >> to the latest version, and since then we can see this message in the > >> logs: > >> > >> registering pdu failed: 263! > >> === 5267 === 5 -> on_handle (from Handler/StoreRemote.pm at 87) > >> === 5267 === 5 -> on_retry (from Handler/StoreRemote.pm at 141) > >> === 5267 === 9 -> on_handle (from Handler/StoreRemote.pm at 87) > >> === 5267 === 9 -> on_retry (from Handler/StoreRemote.pm at 141) > >> === 5267 === !!! Kernel has child processes. > >> === 5267 === !!! Stopped child process (PID 5373) reaped when > >> POE::Kernel->run() is ready to return. > >> === 5267 === !!! Stopped child process (PID 5374) reaped when > >> POE::Kernel->run() is ready to return. > >> === 5267 === !!! At least one child process is still running when > >> POE::Kernel->run() is ready to return. > >> === 5267 === !!! Be sure to use sig_child() to reap child processes. > >> === 5267 === !!! In extreme cases, failure to reap child processes has > >> === 5267 === !!! resulted in a slow 'fork bomb' that has halted systems. > >> mkdir /mnt/nfs99: Permission denied at Handler/Store.pm line 147 > >> > >> first lines and last line above are the errors itself, but this part is > >> new since the upgrading: > >> > >> === 5267 === !!! Kernel has child processes. > >> === 5267 === !!! Stopped child process (PID 5373) reaped when > >> POE::Kernel->run() is ready to return. > >> === 5267 === !!! Stopped child process (PID 5374) reaped when > >> POE::Kernel->run() is ready to return. > >> === 5267 === !!! At least one child process is still running when > >> POE::Kernel->run() is ready to return. > >> === 5267 === !!! Be sure to use sig_child() to reap child processes. > >> === 5267 === !!! In extreme cases, failure to reap child processes has > >> === 5267 === !!! resulted in a slow 'fork bomb' that has halted systems. > >> > >> I can see it everytime the service is stopped because of an unhandled > >> condition, even when POE's event loop has been already running for ours. > >> It was not visible before, and I can't get rid of it in any way. I've > >> tried different ways to avoid it with no luck. > >> > >> Any advice or alternative approach on this? > >> > >> Many thanks > >> Alberto > >> > > > > > >