Hello,
I'm trying to use Jenkins to poll SCM and run a big series of unit and
integration tests that were parallelized using a Perl program written using
Log::Log4Perl, Parallel::ForkManager, and IPC::Run.
I have version 1.424.6 on Ubuntu Server 12.04.
Strange lockups on writes to STDOUT and STDERR, and zombie processes happen
under Jenkins, similar to the leaked process wiki page and some past threads
discussing issues involving not draining stdout and stderr correctly, but
somewhat different. The same commands work fine from the jenkins user over
ssh, and on other users and other machines.
STDOUT and STDERR have been set to flush the buffer after each line is
written, in an attempt to fix whatever stdout / stderr drain issues may exist
(see code example below), but it did not seem to affect the behavior.
The workflow looks like:
1) Write some logs to STDOUT and to file.
2) Use ForkManager to loop over a series of unit test file paths.
3) Use IPC::Run to run each one on a 600 second timeout, check the exit
status, and print out the interesting parts of the test STDOUT and STDERR onto
the real STDOUT if something breaks.
Issues which happen:
1) I'm getting zombie child processes which don't exit, even though ForkManager
normally does this fine and hasn't failed before via other means:
2) For whatever reason, the child process writes to STDOUT and STDERR don't
appear in Jenkins, but they do continue to appear in the Log4Perl disk for a
while after the test begins.
3) Eventually, things begin to stall, the test runner stops forking children,
and the build locks up and will never end, even after 17+ hours waiting
overnight. Running things one at a time with ForkManager's debug mode seems to
work but it results in a build process too slow to be useful for me.
4) When the stall happens, things look like this:
* nothing in jenkins.log:
$ ls -l /var/log/jenkins/jenkins.log
-rw-r--r-- 1 jenkins jenkins 0 Sep 12 06:30 /var/log/jenkins/jenkins.log
* processes in unusual states, zombies, etc:
$ ps auxww | fgrep jenkins
jenkins 498 0.0 0.0 0 0 ? Z 17:42 0:00
[unit_tests.pl] <defunct>
jenkins 584 0.0 0.0 0 0 ? Z 17:42 0:00
[unit_tests.pl] <defunct>
jenkins 648 0.0 0.0 0 0 ? Z 17:42 0:00
[unit_tests.pl] <defunct>
jenkins 876 0.0 0.0 0 0 ? Z 17:43 0:00
[unit_tests.pl] <defunct>
jenkins 1140 0.0 0.0 0 0 ? Z 17:43 0:00
[unit_tests.pl] <defunct>
jenkins 1204 0.0 0.0 0 0 ? Z 17:43 0:00
[unit_tests.pl] <defunct>
jenkins 1513 0.0 0.0 0 0 ? Z 17:43 0:00
[unit_tests.pl] <defunct>
jenkins 22257 0.0 0.0 4392 608 ? S 17:41 0:00 /bin/sh -xe
/tmp/hudson997856685355182685.sh
jenkins 22262 0.0 0.0 58160 12192 ? S 17:41 0:00 /usr/bin/perl
smoke_tests/unit_tests.pl
jenkins 22263 0.1 0.0 58768 11820 ? S 17:41 0:00 /usr/bin/perl
smoke_tests/unit_tests.pl
jenkins 26713 0.0 9.5 12239304 2368828 ? Ssl Sep11 7:32 /usr/bin/java
-jar /usr/share/jenkins/jenkins.war --webroot=/var/run/jenkins/war
--httpPort=8080 --ajp13Port=-1 --preferredClassLoader=java.net.URLClassLoader
--argumentsRealm.passwd.mhall=SecureJenkins2012
--argumentsRealm.roles.mhall=admin --logfile=/var/log/jenkins/jenkins.log
jenkins 31174 0.0 0.0 58780 11428 ? S 17:42 0:00 /usr/bin/perl
smoke_tests/unit_tests.pl
jenkins 31175 0.7 0.1 146320 48380 ? Sl 17:42 0:01
/usr/local/ruby192/bin/ruby -r ./environment.rb
/var/lib/jenkins/workspace/studio_security/root2/test/mu/scenario/field/tc_wget.rb
jenkins 31220 0.0 0.0 58768 11588 ? S 17:42 0:00 /usr/bin/perl
smoke_tests/unit_tests.pl
jenkins 31229 0.0 0.0 0 0 ? Z 17:42 0:00
[unit_tests.pl] <defunct>
jenkins 31988 0.0 0.0 58768 11592 ? S 17:42 0:00 /usr/bin/perl
smoke_tests/unit_tests.pl
jenkins 32367 0.0 0.0 58768 11588 ? S 17:42 0:00 /usr/bin/perl
smoke_tests/unit_tests.pl
* problematic blocking on writes to STDOUT and STDERR:
$ sudo strace -s 4096 -p 22262
Process 22262 attached - interrupt to quit
read(6, ^C <unfinished ...>
$ sudo strace -s 4096 -p 22263
Process 22263 attached - interrupt to quit
write(1, "[2012-09-19 17:43:14,691] [unit_tests.pl:40] [INFO] running tests
from [test/mu/scenario/musl]...\n", 98^C <unfinished ...>
$ sudo strace -s 4096 -p 31220
Process 31220 attached - interrupt to quit
write(1, "[2012-09-19 17:42:37,484] [Common.pm:290] [INFO] stderr [02]: -
RELOAD_Frame = udp(src: &host_2, dst: &host_3, dst_port: 16428)\n", 135^C
<unfinished ...>
$ sudo strace -s 4096 -p 31988
Process 31988 attached - interrupt to quit
write(1, "[2012-09-19 17:42:46,049] [Common.pm:272] [WARN] command [ruby -r
./environment.rb
/var/lib/jenkins/workspace/studio_security/root2/test/mu/scenario/from_pcap/tc_icmp.rb]
failed\n", 178^C <unfinished ...>
* the only message present in the Jenkins log, happens when I cancel the
frozen build:
Sep 19, 2012 6:00:07 PM hudson.model.Run run
INFO: studio_security #21 aborted
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:503)
at java.lang.UNIXProcess.waitFor(UNIXProcess.java:210)
at hudson.Proc$LocalProc.join(Proc.java:319)
at hudson.Launcher$ProcStarter.join(Launcher.java:336)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:82)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:58)
at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19)
at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:692)
at hudson.model.Build$RunnerImpl.build(Build.java:177)
at hudson.model.Build$RunnerImpl.doRun(Build.java:139)
at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:458)
at hudson.model.Run.run(Run.java:1376)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:175)
* this one randomly appeared when doing a restart:
java.lang.NullPointerException
at
hudson.security.ContainerAuthentication.<init>(ContainerAuthentication.java:64)
at
hudson.security.BasicAuthenticationFilter.doFilter(BasicAuthenticationFilter.java:108)
at
hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
at
hudson.security.ChainedServletFilter.doFilter(ChainedServletFilter.java:76)
at hudson.security.HudsonFilter.doFilter(HudsonFilter.java:164)
at winstone.FilterConfiguration.execute(FilterConfiguration.java:195)
at winstone.RequestDispatcher.doFilter(RequestDispatcher.java:368)
at
hudson.util.CharacterEncodingFilter.doFilter(CharacterEncodingFilter.java:81)
at winstone.FilterConfiguration.execute(FilterConfiguration.java:195)
at winstone.RequestDispatcher.doFilter(RequestDispatcher.java:368)
at winstone.RequestDispatcher.forward(RequestDispatcher.java:333)
at
winstone.RequestHandlerThread.processRequest(RequestHandlerThread.java:244)
at winstone.RequestHandlerThread.run(RequestHandlerThread.java:150)
at java.lang.Thread.run(Thread.java:722)
* I tried enabling FINEST level on hudson and jenkins loggers, but didn't see
anything helpful in the output (probably wasn't looking in the right spot).
I tried using Jenkins to run a skeleton example application like this one
below and it runs fine.
Also very important: the problem disappears completely when I disable the
Screen appender writing to STDOUT in my Log4Perl configuration file.
I could really use a helping hand from some pros to debug this some more or
get more verbosity from Jenkins, etc. I spent a long time looking at it and
I'm really confused about what could be going on here! :)
Thanks,
Matthew.
Example:
#!/usr/bin/perl
use strict;
use warnings;
use Parallel::ForkManager 0.7.6;
STDERR->autoflush(1);
STDOUT->autoflush(1);
my $successes = 0;
sub run_on_finish {
my ($pid, $exit, $id, $signal, $core, $d) = @_;
++$successes unless $exit;
}
my $fm = Parallel::ForkManager->new(12);
$fm->run_on_finish(\&run_on_finish);
for (my $i = 1; $i <= 1000; ++$i) {
print STDOUT sprintf("Hello, STDOUT! [P] [%04d]\n", $i);
print STDERR sprintf("Hello, STDERR! [P] [%04d]\n", $i);
$fm->start($i) and next;
my $d = {};
my $status = 0;
print STDOUT sprintf("Hello, STDOUT! [C] [%04d]\n", $i);
print STDERR sprintf("Hello, STDERR! [C] [%04d]\n", $i);
$fm->finish($status, $d);
}
$fm->wait_all_children();
print "FM Successes [$successes]\n";
exit(0);