[
https://issues.apache.org/jira/browse/HBASE-14589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14986708#comment-14986708
]
stack commented on HBASE-14589:
-------------------------------
I think I found it.
This evening 1.2 build on 1.7 JDK failed:
Build failed in Jenkins: HBase-1.2 ยป latest1.7,Hadoop #340
Says:
"[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-surefire-plugin:2.18.1:test (default-test) on
project hbase-server: ExecutionException: java.lang.RuntimeException: The
forked VM terminated without properly saying goodbye. VM crash or System.exit
called?"
If I look for hanging test,.... i.e. a test that started but has no ending.....
kalashnikov:hbase.git stack$ python ./dev-support/findHangingTests.py
https://builds.apache.org/job/HBase-1.2/jdk=latest1.7,label=Hadoop/340/consoleText
Fetching
https://builds.apache.org/job/HBase-1.2/jdk=latest1.7,label=Hadoop/340/consoleText
Building remotely on H4 (Mapreduce zookeeper Hadoop Pig falcon Hdfs) in
workspace
/home/jenkins/jenkins-slave/workspace/HBase-1.2/jdk/latest1.7/label/Hadoop
Printing hanging tests
Hanging test :
org.apache.hadoop.hbase.regionserver.TestMultiVersionConcurrencyControl
Printing Failing tests
I find:
kalashnikov:hbase.git stack$ python ./dev-support/findHangingTests.py
https://builds.apache.org/job/HBase-1.2/jdk=latest1.7,label=Hadoop/340/consoleText
Fetching
https://builds.apache.org/job/HBase-1.2/jdk=latest1.7,label=Hadoop/340/consoleText
Building remotely on H4 (Mapreduce zookeeper Hadoop Pig falcon Hdfs) in
workspace
/home/jenkins/jenkins-slave/workspace/HBase-1.2/jdk/latest1.7/label/Hadoop
Printing hanging tests
Hanging test :
org.apache.hadoop.hbase.regionserver.TestMultiVersionConcurrencyControl
Printing Failing tests
Looking at TestMultiVersionConcurrencyControl, it has 'Killed' written into the
console output just after it starts. Its a SmallTest that runs in a matter of
seconds usually so it does not seem to be a resource problem. When I look at
its surfire files, there is no .xml file and the -output has:
2015-11-03 00:37:29,072 INFO [main] hbase.ResourceChecker(147): before:
regionserver.TestMultiVersionConcurrencyControl#testParallelism Thread=4,
OpenFileDescriptor=181, MaxFileDescriptor=60000, SystemLoadAverage=252,
ProcessCount=313, AvailableMemoryMB=18363, ConnectionCount=0
... and nothing else.
There is no closing 'after:' (This test normally has no output anyways).
So, it does not seem to be a resource problem. Someone is killing us.
Looking up on our jenkins configurations for build jobs we still have this
zombie checking going on:
{code}
if [[ $ZOMBIE_TESTS_COUNT != 0 ]] ; then
#It seems sometimes the tests are not dying immediately. Let's give them 10s
echo "Suspicious java process found - waiting 10s to see if there are just
slow to stop"
sleep 10
ZOMBIE_TESTS_COUNT=`jps | grep surefirebooter | wc -l`
if [[ $ZOMBIE_TESTS_COUNT != 0 ]] ; then
echo "There are $ZOMBIE_TESTS_COUNT zombie tests, they should have been
killed by surefire but survived"
echo "************ BEGIN zombies jstack extract"
ZB_STACK=`jps | grep surefirebooter | cut -d ' ' -f 1 | xargs -n 1 jstack
| grep ".test" | grep "\.java"`
jps | grep surefirebooter | cut -d ' ' -f 1 | xargs -n 1 jstack
echo "************ END zombies jstack extract"
JIRA_COMMENT="$JIRA_COMMENT
{color:red}-1 core zombie tests{color}. There are ${ZOMBIE_TESTS_COUNT}
zombie test(s): ${ZB_STACK}"
BAD=1
jps | grep surefirebooter | cut -d ' ' -f 1 | xargs kill -9
else
echo "We're ok: there is no zombie test, but some tests took some time to
stop"
fi
else
echo "We're ok: there is no zombie test"
fi
{code}
Its from test-patch.sh.
We'll run it at end of each run. It is more selective now. We only will kill
stuff of ours if it sticks around 30 seconds.
So, poking on jenkins I see that there is a history on what ran on what machine
when: https://builds.apache.org/computer/H4/builds
The machine that ran the 1.2 job that had a test that was killed was on H4. The
MURDER happened as we can see from test output just after 2015-11-03
00:37:29,072
Looking at the test history, it looks like a few hbase-it tests were in the
neighbourhood at the time. Two of the three left the scene early but.....
https://builds.apache.org/job/HBase-Trunk-IT/it.test=IntegrationTestImportTsv,jdk=latest1.7,label=Hadoop/328/consoleText
got stuck on the end and hung around a while. On the end of the console output
I see....
{code}
....
+ echo '************ END zombies jstack extract'
************ END zombies jstack extract
+ JIRA_COMMENT='
{color:red}-1 core zombie tests{color}. There are 1 zombie test(s):
at
org.apache.hadoop.hbase.regionserver.TestMultiVersionConcurrencyControl.testParallelism(TestMultiVersionConcurrencyControl.java:115)'
+ BAD=1
+ jps
+ grep surefirebooter
+ cut -d ' ' -f 1
+ xargs kill -9
POST BUILD TASK : SUCCESS
END OF POST BUILD TASK : 0
Archiving artifacts
Finished: SUCCESS
{code}
If I backup a bit... i can find a timestamp...
{code}
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 29:40.686s
[INFO] Finished at: Tue Nov 03 00:37:25 UTC 2015
[INFO] Final Memory: 205M/4354M
....
{code}
Cue the dragnet theme.
K. Going to purge all these killers from test-patch and form jenkins.... be
back.
> Looking for the surefire-killer; builds being killed...
> -------------------------------------------------------
>
> Key: HBASE-14589
> URL: https://issues.apache.org/jira/browse/HBASE-14589
> Project: HBase
> Issue Type: Sub-task
> Components: test
> Reporter: stack
> Assignee: stack
> Attachments: 14589.mx.patch, 14589.timeout.txt, 14589.txt,
> 14598.addendum.sufire.timeout.patch
>
>
> I see this in a build that started at two hours ago... about 6:45... its
> build 15941 on ubuntu-6
> {code}
> WARNING: 2 rogue build processes detected, terminating.
> /bin/kill -9 18640
> /bin/kill -9 22625
> {code}
> If I back up to build 15939, started about 3 1/2 hours ago, say, 5:15.... I
> see:
> Running org.apache.hadoop.hbase.client.TestShell
> Killed
> ... but it was running on ubuntu-1.... so it doesn't look like we are killing
> ourselves... when we do this in test-patch.sh
> ### Kill any rogue build processes from the last attempt
> $PS auxwww | $GREP ${PROJECT_NAME}PatchProcess | $AWK '{print $2}' |
> /usr/bin/xargs -t -I {} /bin/kill -9 {} > /dev/null
> The above code runs in a few places... in test-patch.sh.
> Let me try and add some more info around what is being killed...
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)