[ 
https://issues.apache.org/jira/browse/HBASE-19204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16262915#comment-16262915
 ] 

Xiao Chen commented on HBASE-19204:
-----------------------------------

We also ran into this in dist-test too, since the slave was ubuntu with openjdk 
7u151. We also changed jdk version to unblock testing, but not fully root 
caused yet. Only difference is we symlinked to 7u161 (From 
https://www.azul.com/downloads/zulu/zulu-linux/), which appeared to make the 
issue go away. 

Details below:
This reproduces with some hadoop unit tests, simply by {{mvn test -Dtest=}}. It 
happens when minidfscluster is shutdown.

After that, we see 2 busy cores, due to the surefirebooter subprocess, inside 
which it's very likely 2 threads are live spin locked or the like.
{noformat}
$ top
Tasks:   5 total,   1 running,   4 sleeping,   0 stopped,   0 zombie
%Cpu0  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.0 us,  0.7 sy,  0.0 ni, 99.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 97.0 us,  3.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:   2046752 total,  1569804 used,   476948 free,   111284 buffers
KiB Swap:  1048572 total,   136416 used,   912156 free.   593724 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND     
                                                                                
                                                                                
                           
560 root      20   0 6319436 573104  21048 S 200.0 28.0   3:57.93 java    

$ ps -ef|grep java
root       499 23241 47 16:54 pts/1    00:01:00 
/usr/lib/jvm/java-1.7.0-openjdk-amd64//bin/java -classpath 
/usr/share/maven/boot/plexus-classworlds-2.x.jar 
-Dclassworlds.conf=/usr/share/maven/bin/m2.conf -Dmaven.home=/usr/share/maven 
org.codehaus.plexus.classworlds.launcher.Launcher test 
-Dtest=TestBalancerRPCDelay
root       559   499  0 16:54 pts/1    00:00:00 /bin/sh -c cd 
/hadoop/hadoop-hdfs-project/hadoop-hdfs && 
/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java -Xmx2048m -XX:MaxPermSize=768m 
-XX:+HeapDumpOnOutOfMemoryError -jar 
/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/surefire/surefirebooter8947784507952007015.jar
 
/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/surefire/surefire5967375233747058316tmp
 
/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/surefire/surefire_03768054875548617068tmp
root       560   559 99 16:54 pts/1    00:02:46 
/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java -Xmx2048m -XX:MaxPermSize=768m 
-XX:+HeapDumpOnOutOfMemoryError -jar 
/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/surefire/surefirebooter8947784507952007015.jar
 
/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/surefire/surefire5967375233747058316tmp
 
/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/surefire/surefire_03768054875548617068tmp

$ ps -Tp 560|grep -v 00:00:0
  PID  SPID TTY          TIME CMD
  560   568 pts/1    00:02:44 java
  560  3088 pts/1    00:02:43 java                                              
                                                                                
                                                           
{noformat}

Tricky part is, since this appears to be a jvm bug, the usual java tooling 
won't work. This includes: jstack, jcmd, attempts in the unit test to have a 
background thread printing all stack traces. Also tried strace but not able to 
attach, not sure if this is due to this being run in docker or not...

Anyways, I'm attaching a tarball of the surefire stuff, which I was able to 
reproduce in the docker container by running the following:
{code}
 /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java -Xmx2048m -XX:MaxPermSize=768m 
-XX:+HeapDumpOnOutOfMemoryError -jar surefirebooter715602899278246659.jar 
surefire2008821777752463579tmp surefire_0176004967692917743tmp
{code}
Hope this helps.

Thanks [~jojochuang] for pointing me to this jira.

> branch-1.2 times out and is taking 6-7 hours to complete
> --------------------------------------------------------
>
>                 Key: HBASE-19204
>                 URL: https://issues.apache.org/jira/browse/HBASE-19204
>             Project: HBase
>          Issue Type: Umbrella
>          Components: test
>            Reporter: stack
>
> Sean has been looking at tooling and infra. This Umbrellas is about looking 
> at actual tests. For example, running locally on dedicated machine I picked a 
> random test, TestPerColumnFamilyFlush. In my test run, it wrote 16M lines. It 
> seems to be having zk issues but it is catching interrupts and ignoring them 
> ([~carp84] fixed this in later versions over in HBASE-18441).
> Let me try and do some fixup under this umbrella so we can get a 1.2.7 out 
> the door.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to