[
https://issues.apache.org/jira/browse/HBASE-19204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16262915#comment-16262915
]
Xiao Chen commented on HBASE-19204:
-----------------------------------
We also ran into this in dist-test too, since the slave was ubuntu with openjdk
7u151. We also changed jdk version to unblock testing, but not fully root
caused yet. Only difference is we symlinked to 7u161 (From
https://www.azul.com/downloads/zulu/zulu-linux/), which appeared to make the
issue go away.
Details below:
This reproduces with some hadoop unit tests, simply by {{mvn test -Dtest=}}. It
happens when minidfscluster is shutdown.
After that, we see 2 busy cores, due to the surefirebooter subprocess, inside
which it's very likely 2 threads are live spin locked or the like.
{noformat}
$ top
Tasks: 5 total, 1 running, 4 sleeping, 0 stopped, 0 zombie
%Cpu0 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 0.0 us, 0.7 sy, 0.0 ni, 99.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 97.0 us, 3.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 2046752 total, 1569804 used, 476948 free, 111284 buffers
KiB Swap: 1048572 total, 136416 used, 912156 free. 593724 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
560 root 20 0 6319436 573104 21048 S 200.0 28.0 3:57.93 java
$ ps -ef|grep java
root 499 23241 47 16:54 pts/1 00:01:00
/usr/lib/jvm/java-1.7.0-openjdk-amd64//bin/java -classpath
/usr/share/maven/boot/plexus-classworlds-2.x.jar
-Dclassworlds.conf=/usr/share/maven/bin/m2.conf -Dmaven.home=/usr/share/maven
org.codehaus.plexus.classworlds.launcher.Launcher test
-Dtest=TestBalancerRPCDelay
root 559 499 0 16:54 pts/1 00:00:00 /bin/sh -c cd
/hadoop/hadoop-hdfs-project/hadoop-hdfs &&
/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java -Xmx2048m -XX:MaxPermSize=768m
-XX:+HeapDumpOnOutOfMemoryError -jar
/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/surefire/surefirebooter8947784507952007015.jar
/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/surefire/surefire5967375233747058316tmp
/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/surefire/surefire_03768054875548617068tmp
root 560 559 99 16:54 pts/1 00:02:46
/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java -Xmx2048m -XX:MaxPermSize=768m
-XX:+HeapDumpOnOutOfMemoryError -jar
/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/surefire/surefirebooter8947784507952007015.jar
/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/surefire/surefire5967375233747058316tmp
/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/surefire/surefire_03768054875548617068tmp
$ ps -Tp 560|grep -v 00:00:0
PID SPID TTY TIME CMD
560 568 pts/1 00:02:44 java
560 3088 pts/1 00:02:43 java
{noformat}
Tricky part is, since this appears to be a jvm bug, the usual java tooling
won't work. This includes: jstack, jcmd, attempts in the unit test to have a
background thread printing all stack traces. Also tried strace but not able to
attach, not sure if this is due to this being run in docker or not...
Anyways, I'm attaching a tarball of the surefire stuff, which I was able to
reproduce in the docker container by running the following:
{code}
/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java -Xmx2048m -XX:MaxPermSize=768m
-XX:+HeapDumpOnOutOfMemoryError -jar surefirebooter715602899278246659.jar
surefire2008821777752463579tmp surefire_0176004967692917743tmp
{code}
Hope this helps.
Thanks [~jojochuang] for pointing me to this jira.
> branch-1.2 times out and is taking 6-7 hours to complete
> --------------------------------------------------------
>
> Key: HBASE-19204
> URL: https://issues.apache.org/jira/browse/HBASE-19204
> Project: HBase
> Issue Type: Umbrella
> Components: test
> Reporter: stack
>
> Sean has been looking at tooling and infra. This Umbrellas is about looking
> at actual tests. For example, running locally on dedicated machine I picked a
> random test, TestPerColumnFamilyFlush. In my test run, it wrote 16M lines. It
> seems to be having zk issues but it is catching interrupts and ignoring them
> ([~carp84] fixed this in later versions over in HBASE-18441).
> Let me try and do some fixup under this umbrella so we can get a 1.2.7 out
> the door.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)