Hi, Sorry, but I don't catch what you exactly mean with "not using sig_child() as intended". Do you mean calling it from the main session so each child process will be closed properly?
The issue I have is how to handle unexpected exceptions. Seems they are thrown and raised without control, killing POE's kernel before in the way. I could be thinking in the timing in the wrong way, though... Alberto ---- Activado lun, 24 mar 2014 16:59:49 +0100 Rocco Caputo<[email protected]> escribió ---- > You are not using sig_child() as intended. When used as intended, > sig_child() will prevent shutdown until the child process has exited and has > been reaped. The timing issues you're worried about should not exist. > > -- > Rocco Caputo <[email protected]> > > On Mar 24, 2014, at 11:44, albertocurro <[email protected]> wrote: > > > Hi Rocco, > > > > many thanks for your quick answer! Unfortunately, the provided solution > > only works partially. I still have some cases where the "fork bomb" > > message is here with us :( > > > > One of the cases is this one: under some configuration, an instance of > > nginx is started, so our product writes the configuration file and starts > > the Nginx instance pointing to that configuration file. BUT, if the > > configuration file could not be written (directory does not exist, etc), > > then the error raises, and I've not found any way to handle it: > > > > DEBUG - Created nginx temporary directory /opt/tmp/pull/instance1 > > DEBUG - Created nginx configuration directory /opt/etc/pull/instance1 > > DEBUG - Created nginx log directory /opt/log/pull/instance1 > > DEBUG - creating nginx configfile for instance 1 in > > /opt/etc/pull/instance1 > > === 13991 === !!! Kernel has 1 child process(es). > > === 13991 === !!! At least one child process is still running when > > POE::Kernel->run() is ready to return. > > === 13991 === !!! Be sure to use sig_child() to reap child processes. > > === 13991 === !!! In extreme cases, failure to reap child processes has > > === 13991 === !!! resulted in a slow 'fork bomb' that has halted systems. > > Could not open file: No such file or directory > > > > I've added a DIE handler in the main session to try to handle this: > > > > $sig_session = POE::Session->create( > > inline_states => { > > _start => sub { > > $_[HEAP]{RELOADED} = 0; > > $_[KERNEL]->sig(TERM => '_sigterm'); > > $_[KERNEL]->sig(INT => '_sigterm'); > > $_[KERNEL]->sig(DIE => '_sigterm'); > > $_[KERNEL]->sig(nginx_reload => '_sig_nginx_reload'); > > $_[KERNEL]->alias_set('sighandler'); > > }, > > _sigdie => sub { > > print "Handling exception, calling stop"; > > POE::Kernel->call($sig_session, '_stop'); > > }, > > _stop => sub { > > # Reap any existing pid (# 1825119) > > print "Handling stop"; > > POE::Kernel->sig_child(); > > use POSIX ":sys_wait_h"; > > 1 while waitpid(WNOHANG, -1) > 0; > > > > # Clear signal handlers... > > $_[KERNEL]->sig('TERM'); > > > > But, as said above, it's not working. Checking POE's code, I can see the > > message lines are generated in Resources/Signals.pm, under > > _data_sig_finalize() method (where POE is already doing the same you > > recommended me, waiting for the pid). > > > > But _data_sig_finalize() method is called in Kernel.pm just after > > unregistered all the signals (Kernel.pm => _finalize_kernel): > > > > my $self = shift; > > > > # Disable signal watching since there's now no place for them to go. > > foreach ($self->_data_sig_get_safe_signals()) { > > $self->loop_ignore_signal($_); > > } > > > > # Remove the kernel session's signal watcher. > > $self->_data_sig_remove($self->ID, "IDLE"); > > > > # The main loop is done, no matter which event library ran it. > > # sig before loop so that it clears the signal_pipe file handler > > $self->_data_sig_finalize(); > > $self->loop_finalize(); > > > > Once here, none of my signal handlers in the main session instance would > > work, as the signals have been unregistered. On an exception (die) while > > POE::Kernel->run(), how could I handle it then?? > > > > Thanks a lot > > Alberto > > > > > > > > > > ---- Activado lun, 24 mar 2014 13:45:45 +0100 Rocco Caputo escribió ---- > > > >> Hi, Alberto. > >> > >> At program end time, POE runs a quick waitpid() check for child processes > >> that may have leaked. This check was added after a bug report where POE > >> locked up a server after several days of running. It turned out to be the > >> reporter's application, but it was hard to debug. > >> > >> Your program seems to have created two processes that it didn't reap: > >> PIDs 5373 and 5374. The ideal solution is to reap those processes before > >> exiting. Your program can do this using POE::Kernel's sig_child() method. > >> > >> > >> In some cases, a third-party library will create processes and not > >> properly clean them up. It can be impossible to solve this case without > >> modifying other people's code. > >> > >> If you just want to ignore the problem, this might do the trick. Put > >> these lines in your last _stop handler. They should reap the processes > >> you've leaked before POE's check: > >> > >> use POSIX ":sys_wait_h"; > >> 1 while waitpid(WNOHANG, -1) > 0; > >> > >> It's a bit of a pain, but I think it's better to explicitly ignore the > >> problem than for it to go unnoticed by default. > >> > >> Please let me know whether that resolves your problem. It may not. For > >> example, the processes may still be open until an object is destroyed at > >> global destruction time. > >> > >> -- > >> Rocco Caputo > >> > >> On Mar 24, 2014, at 05:46, albertocurro wrote: > >> > >>> Guys, > >>> > >>> We have a product developed using POE as a base framework, with some > >>> other tool libraries as log4perl; basically is a forward proxy, composed > >>> of several modules, each one of them comprising a POE::Session; all of > >>> them share an internal queue of tasks to be performed. Each module > >>> performs several tasks on initialization, and if anything goes wrong, > >>> croak() is called to stop the service -> this is considered ok, since > >>> croak() is only called during initialization, when validation is being > >>> performed. > >>> > >>> The product is stable and works really fine, but recently I updated POE > >>> to the latest version, and since then we can see this message in the > >>> logs: > >>> > >>> registering pdu failed: 263! > >>> === 5267 === 5 -> on_handle (from Handler/StoreRemote.pm at 87) > >>> === 5267 === 5 -> on_retry (from Handler/StoreRemote.pm at 141) > >>> === 5267 === 9 -> on_handle (from Handler/StoreRemote.pm at 87) > >>> === 5267 === 9 -> on_retry (from Handler/StoreRemote.pm at 141) > >>> === 5267 === !!! Kernel has child processes. > >>> === 5267 === !!! Stopped child process (PID 5373) reaped when > >>> POE::Kernel->run() is ready to return. > >>> === 5267 === !!! Stopped child process (PID 5374) reaped when > >>> POE::Kernel->run() is ready to return. > >>> === 5267 === !!! At least one child process is still running when > >>> POE::Kernel->run() is ready to return. > >>> === 5267 === !!! Be sure to use sig_child() to reap child processes. > >>> === 5267 === !!! In extreme cases, failure to reap child processes has > >>> === 5267 === !!! resulted in a slow 'fork bomb' that has halted systems. > >>> > >>> mkdir /mnt/nfs99: Permission denied at Handler/Store.pm line 147 > >>> > >>> first lines and last line above are the errors itself, but this part is > >>> new since the upgrading: > >>> > >>> === 5267 === !!! Kernel has child processes. > >>> === 5267 === !!! Stopped child process (PID 5373) reaped when > >>> POE::Kernel->run() is ready to return. > >>> === 5267 === !!! Stopped child process (PID 5374) reaped when > >>> POE::Kernel->run() is ready to return. > >>> === 5267 === !!! At least one child process is still running when > >>> POE::Kernel->run() is ready to return. > >>> === 5267 === !!! Be sure to use sig_child() to reap child processes. > >>> === 5267 === !!! In extreme cases, failure to reap child processes has > >>> === 5267 === !!! resulted in a slow 'fork bomb' that has halted systems. > >>> > >>> > >>> I can see it everytime the service is stopped because of an unhandled > >>> condition, even when POE's event loop has been already running for ours. > >>> It was not visible before, and I can't get rid of it in any way. I've > >>> tried different ways to avoid it with no luck. > >>> > >>> Any advice or alternative approach on this? > >>> > >>> Many thanks > >>> Alberto > >
