[
https://issues.apache.org/jira/browse/HBASE-4830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046571#comment-15046571
]
Yu Li commented on HBASE-4830:
------------------------------
[~stack] two of our online machines met with OOME and aborted by this
{{-XX:OnOutOfMemoryError="kill -9 %p"}} config and the error message was
chopped in the .out file and nothing (stacktrace or anything for debug) left in
.log file. Below is the content of our .out file:
{noformat}
/home/hadoop/hadoop_hbase/hbase-online/bin/hbase-daemon.sh: line 216: 217101
Killed nice -n $HBASE_NICENESS "$HBASE_HOME"/bin/hbase
--config "${HBASE_CONF_DIR}" $command "$@" start >> ${HBASE_LOGOUT} 2>&1
eeds VM limit
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 217101"...
{noformat}
>From which I could only guess the cause of OOME is "Requested array size
>exceeds VM limit" and have no way to debug it.
So my question is that from maintenance perspective, is it better to kill -9
when OOME happens or to leave the hang RS there for trouble shooting? Is the
"kill -9" config for making HMaster detects the RS failure more quickly thus
reducing recovery time? Thanks.
btw, checking the JVM source, JVM will fork a process to execute the kill -9
command before printing error stacktrace, see line 160~182 of
[collectedHeap.inline.hpp|http://hg.openjdk.java.net/jdk7u/jdk7u/hotspot/file/d61a34c5c764/src/share/vm/gc_interface/collectedHeap.inline.hpp]
and end lines of
[vmError.cpp|http://hg.openjdk.java.net/jdk7u/jdk7u/hotspot/file/d61a34c5c764/src/share/vm/utilities/vmError.cpp].
Below is an excerpt:
{code}
if (!gc_overhead_limit_was_exceeded) {
// -XX:+HeapDumpOnOutOfMemoryError and -XX:OnOutOfMemoryError support
report_java_out_of_memory("Java heap space");
if (JvmtiExport::should_post_resource_exhausted()) {
JvmtiExport::post_resource_exhausted(
JVMTI_RESOURCE_EXHAUSTED_OOM_ERROR | JVMTI_RESOURCE_EXHAUSTED_JAVA_HEAP,
"Java heap space");
}
THROW_OOP_0(Universe::out_of_memory_error_java_heap());
} else {
// -XX:+HeapDumpOnOutOfMemoryError and -XX:OnOutOfMemoryError support
report_java_out_of_memory("GC overhead limit exceeded");
if (JvmtiExport::should_post_resource_exhausted()) {
JvmtiExport::post_resource_exhausted(
JVMTI_RESOURCE_EXHAUSTED_OOM_ERROR | JVMTI_RESOURCE_EXHAUSTED_JAVA_HEAP,
"GC overhead limit exceeded");
}
THROW_OOP_0(Universe::out_of_memory_error_gc_overhead_limit());
}
}
void VMError::report_java_out_of_memory() {
if (OnOutOfMemoryError && OnOutOfMemoryError[0]) {
MutexLocker ml(Heap_lock);
VM_ReportJavaOutOfMemory op(this);
VMThread::execute(&op);
}
}
{code}
> Regionserver BLOCKED on WAITING DFSClient$DFSOutputStream.waitForAckedSeqno
> running 0.20.205.0+
> -----------------------------------------------------------------------------------------------
>
> Key: HBASE-4830
> URL: https://issues.apache.org/jira/browse/HBASE-4830
> Project: HBase
> Issue Type: Bug
> Reporter: stack
> Assignee: stack
> Fix For: 0.92.0
>
> Attachments: 4830-v2.txt, 4830.txt,
> hbase-stack-regionserver-sv4r9s38.out
>
>
> Running 0.20.205.1 (I was not at tip of the branch) I ran into the following
> hung regionserver:
> {code}
> "regionserver7003.logRoller" daemon prio=10 tid=0x00007fd98028f800 nid=0x61af
> in Object.wait() [0x00007fd987bfa000]
> java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:485)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.waitForAckedSeqno(DFSClient.java:3606)
> - locked <0x00000000f8656788> (a java.util.LinkedList)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.flushInternal(DFSClient.java:3595)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3687)
> - locked <0x00000000f8656458> (a
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3626)
> at
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
> at
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
> at
> org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:966)
> - locked <0x00000000f8655998> (a
> org.apache.hadoop.io.SequenceFile$Writer)
> at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.close(SequenceFileLogWriter.java:214)
> at
> org.apache.hadoop.hbase.regionserver.wal.HLog.cleanupCurrentWriter(HLog.java:791)
> at
> org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:578)
> - locked <0x00000000c443deb0> (a java.lang.Object)
> at
> org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:94)
> at java.lang.Thread.run(Thread.java:662)
> {code}
> Other threads are like this (here's a sample):
> {code}
> "regionserver7003.logSyncer" daemon prio=10 tid=0x00007fd98025e000 nid=0x61ae
> waiting for monitor entry [0x00007fd987cfb000]
> java.lang.Thread.State: BLOCKED (on object monitor)
> at
> org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1074)
> - waiting to lock <0x00000000c443deb0> (a java.lang.Object)
> at org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1195)
> at
> org.apache.hadoop.hbase.regionserver.wal.HLog$LogSyncer.run(HLog.java:1057)
> at java.lang.Thread.run(Thread.java:662)
> ....
> "IPC Server handler 0 on 7003" daemon prio=10 tid=0x00007fd98049b800
> nid=0x61b8 waiting for monitor entry [0x00007fd9872f1000]
> java.lang.Thread.State: BLOCKED (on object monitor)
> at
> org.apache.hadoop.hbase.regionserver.wal.HLog.append(HLog.java:1007)
> - waiting to lock <0x00000000c443deb0> (a java.lang.Object)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchPut(HRegion.java:1798)
> at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1668)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:2980)
> at sun.reflect.GeneratedMethodAccessor636.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364)
> at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1325)
> {code}
> Looks like HDFS-1529? (Todd?)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)