[ 
https://issues.apache.org/jira/browse/HBASE-14589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14986708#comment-14986708
 ] 

stack commented on HBASE-14589:
-------------------------------

I think I found it.

This evening 1.2 build on 1.7 JDK failed:

Build failed in Jenkins: HBase-1.2 ยป latest1.7,Hadoop #340

Says:

"[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-surefire-plugin:2.18.1:test (default-test) on 
project hbase-server: ExecutionException: java.lang.RuntimeException: The 
forked VM terminated without properly saying goodbye. VM crash or System.exit 
called?"

If I look for hanging test,.... i.e. a test that started but has no ending.....

kalashnikov:hbase.git stack$ python ./dev-support/findHangingTests.py  
https://builds.apache.org/job/HBase-1.2/jdk=latest1.7,label=Hadoop/340/consoleText
Fetching 
https://builds.apache.org/job/HBase-1.2/jdk=latest1.7,label=Hadoop/340/consoleText
Building remotely on H4 (Mapreduce zookeeper Hadoop Pig falcon Hdfs) in 
workspace 
/home/jenkins/jenkins-slave/workspace/HBase-1.2/jdk/latest1.7/label/Hadoop
Printing hanging tests
Hanging test : 
org.apache.hadoop.hbase.regionserver.TestMultiVersionConcurrencyControl
Printing Failing tests

I find:

kalashnikov:hbase.git stack$ python ./dev-support/findHangingTests.py  
https://builds.apache.org/job/HBase-1.2/jdk=latest1.7,label=Hadoop/340/consoleText
Fetching 
https://builds.apache.org/job/HBase-1.2/jdk=latest1.7,label=Hadoop/340/consoleText
Building remotely on H4 (Mapreduce zookeeper Hadoop Pig falcon Hdfs) in 
workspace 
/home/jenkins/jenkins-slave/workspace/HBase-1.2/jdk/latest1.7/label/Hadoop
Printing hanging tests
Hanging test : 
org.apache.hadoop.hbase.regionserver.TestMultiVersionConcurrencyControl
Printing Failing tests

Looking at TestMultiVersionConcurrencyControl, it has 'Killed' written into the 
console output just after it starts. Its a SmallTest that runs in a matter of 
seconds usually so it does not seem to be a resource problem. When I look at 
its surfire files, there is no .xml file and the -output has:

2015-11-03 00:37:29,072 INFO  [main] hbase.ResourceChecker(147): before: 
regionserver.TestMultiVersionConcurrencyControl#testParallelism Thread=4, 
OpenFileDescriptor=181, MaxFileDescriptor=60000, SystemLoadAverage=252, 
ProcessCount=313, AvailableMemoryMB=18363, ConnectionCount=0

... and nothing else.

There is no closing 'after:'  (This test normally has no output anyways).

So, it does not seem to be a resource problem. Someone is killing us.

Looking up on our jenkins configurations for build jobs we still have this 
zombie checking going on:

{code}
  if [[ $ZOMBIE_TESTS_COUNT != 0 ]] ; then
    #It seems sometimes the tests are not dying immediately. Let's give them 10s
    echo "Suspicious java process found - waiting 10s to see if there are just 
slow to stop"
    sleep 10   
    ZOMBIE_TESTS_COUNT=`jps | grep surefirebooter | wc -l`
    if [[ $ZOMBIE_TESTS_COUNT != 0 ]] ; then
      echo "There are $ZOMBIE_TESTS_COUNT zombie tests, they should have been 
killed by surefire but survived"
      echo "************ BEGIN zombies jstack extract"
      ZB_STACK=`jps | grep surefirebooter | cut -d ' ' -f 1 | xargs -n 1 jstack 
| grep ".test" | grep "\.java"`
      jps | grep surefirebooter | cut -d ' ' -f 1 | xargs -n 1 jstack
      echo "************ END  zombies jstack extract"
      JIRA_COMMENT="$JIRA_COMMENT

     {color:red}-1 core zombie tests{color}.  There are ${ZOMBIE_TESTS_COUNT} 
zombie test(s): ${ZB_STACK}"
      BAD=1
      jps | grep surefirebooter | cut -d ' ' -f 1 | xargs kill -9
    else
      echo "We're ok: there is no zombie test, but some tests took some time to 
stop"
    fi
  else
    echo "We're ok: there is no zombie test"
  fi
{code}

Its from test-patch.sh.

We'll run it at end of each run. It is more selective now. We only will kill 
stuff of ours if it sticks around 30 seconds.

So, poking on jenkins I see that there is a history on what ran on what machine 
when: https://builds.apache.org/computer/H4/builds

The machine that ran the 1.2 job that had a test that was killed was on H4. The 
MURDER happened as we can see from test output just after 2015-11-03 
00:37:29,072 

Looking at the test history, it looks like a few hbase-it tests were in the 
neighbourhood at the time.  Two of the three left the scene early but.....

https://builds.apache.org/job/HBase-Trunk-IT/it.test=IntegrationTestImportTsv,jdk=latest1.7,label=Hadoop/328/consoleText

got stuck on the end and hung around a while. On the end of the console output 
I see....



{code}
....
+ echo '************ END  zombies jstack extract'
************ END  zombies jstack extract
+ JIRA_COMMENT='

     {color:red}-1 core zombie tests{color}.  There are 1 zombie test(s):       
at 
org.apache.hadoop.hbase.regionserver.TestMultiVersionConcurrencyControl.testParallelism(TestMultiVersionConcurrencyControl.java:115)'
+ BAD=1
+ jps
+ grep surefirebooter
+ cut -d ' ' -f 1
+ xargs kill -9
POST BUILD TASK : SUCCESS
END OF POST BUILD TASK : 0
Archiving artifacts
Finished: SUCCESS
{code}

If I backup a bit... i can find a timestamp...


{code}
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 29:40.686s
[INFO] Finished at: Tue Nov 03 00:37:25 UTC 2015
[INFO] Final Memory: 205M/4354M
....

{code}

Cue the dragnet theme.

K. Going to purge all these killers from test-patch and form jenkins.... be 
back.





> Looking for the surefire-killer; builds being killed...
> -------------------------------------------------------
>
>                 Key: HBASE-14589
>                 URL: https://issues.apache.org/jira/browse/HBASE-14589
>             Project: HBase
>          Issue Type: Sub-task
>          Components: test
>            Reporter: stack
>            Assignee: stack
>         Attachments: 14589.mx.patch, 14589.timeout.txt, 14589.txt, 
> 14598.addendum.sufire.timeout.patch
>
>
> I see this in a build that started at two hours ago... about 6:45... its 
> build 15941 on ubuntu-6
> {code}
> WARNING: 2 rogue build processes detected, terminating.
> /bin/kill -9 18640 
> /bin/kill -9 22625 
> {code}
> If I back up to build 15939, started about 3 1/2 hours ago, say, 5:15....  I 
> see:
> Running org.apache.hadoop.hbase.client.TestShell
> Killed
> ... but it was running on ubuntu-1.... so it doesn't look like we are killing 
> ourselves...  when we do this in test-patch.sh
>   ### Kill any rogue build processes from the last attempt
>   $PS auxwww | $GREP ${PROJECT_NAME}PatchProcess | $AWK '{print $2}' | 
> /usr/bin/xargs -t -I {} /bin/kill -9 {} > /dev/null
> The above code runs in a few places... in test-patch.sh.
> Let me try and add some more info around what is being killed... 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to