Mingjie Lai created HBASE-13782:
-----------------------------------
Summary: RS stuck after FATAL ``FSHLog: Could not append.''
Key: HBASE-13782
URL: https://issues.apache.org/jira/browse/HBASE-13782
Project: HBase
Issue Type: Bug
Components: wal
Affects Versions: 1.0.1
Environment: hbaes version: 1.0.0-cdh5.4.0
hadoop version: 2.6.0-cdh5.4.0
Reporter: Mingjie Lai
Priority: Critical
hbaes version: 1.0.0-cdh5.4.0
hadoop version: 2.6.0-cdh5.4.0
Environment: 40-node hadoop cluster shared with a 10-node hbase cluster and a
30-node yarn.
We started to see that one RS stopped to serve any client request since
2015-05-26 01:05:33, while all other RS were okay. I checked RS log and found
that there are some FATAL logs when
org.apache.hadoop.hbase.regionserver.wal.FSHLog tried to append() and sync{}:
{code}
2015-05-26 01:05:33,700 FATAL org.apache.hadoop.hbase.regionserver.wal.FSHLog:
Could not append. Requesting close of wal
java.io.IOException: Bad connect ack with firstBadLink as 10.28.1.17:50010
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:600)
2015-05-26 01:05:33,700 FATAL org.apache.hadoop.hbase.regionserver.wal.FSHLog:
Could not append. Requesting close of wal
java.io.IOException: Bad connect ack with firstBadLink as 10.28.1.17:50010
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:600)
2015-05-26 01:05:33,700 FATAL org.apache.hadoop.hbase.regionserver.wal.FSHLog:
Could not append. Requesting close of wal
java.io.IOException: Bad connect ack with firstBadLink as 10.28.1.17:50010
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:600)
2015-05-26 01:05:33,700 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog:
Archiving
hdfs://nameservice1/hbase/WALs/hbase08.company.com,60020,1431985722474/hbase08.company.com%2C60020%2C1431985722474.default.1432602140966
to
hdfs://nameservice1/hbase/oldWALs/hbase08.company.com%2C60020%2C1431985722474.default.1432602140966
2015-05-26 01:05:33,701 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog:
Error syncing, request close of wal
{code}
Since the HDFS cluster is shared with a YARN cluster, at the time, there were
some io heavy jobs running, and exhausted xciever at some of the DNs at the
exact same time. I think it's the reason why the RS got ``java.io.IOException:
Bad connect ack with firstBadLink''
The problem is, the RS got stuck without any response since then.
flushQueueLength grew to the ceiling and stayed there. The only log entries are
from periodicFlusher:
{code}
2015-05-26 02:06:26,742 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer:
regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher requesting
flush for region
myns:mytable,3992+80bb1,1432526964367.c4906e519c1f8206a284c66a8eda2159. after a
delay of 11000
2015-05-26 02:06:26,742 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer:
regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher requesting
flush for region
myns:mytable,0814+0416,1432541066864.cf42d5ab47e051d69e516971e82e84be. after a
delay of 7874
2015-05-26 02:06:26,742 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer:
regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher requesting
flush for region
myns:mytable,2022+7a571,1432528246524.299c1d4bb28fda2a4d9f248c6c22153c. after a
delay of 22740
2015-05-26 02:06:26,742 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer:
regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher requesting
flush for region
myns:mytable,2635+b9b677,1432540367215.749efc885317a2679e2ea39bb0845fbe. after
a delay of 3162
2015-05-26 02:06:26,742 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer:
regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher requesting
flush for region
myns:mytable,0401+985e,1432527151473.eb97576381fce10a9616efd471103920. after a
delay of 9142
{code}
Looks like there is a RS level deadlock triggered by the FATAL append exception
handling. In the end, I had to restart the RS service to rescue the regions
from the stuck RS.
{code}
} catch (Exception e) {
LOG.fatal("Could not append. Requesting close of wal", e);
requestLogRoll();
throw e;
}
numEntries.incrementAndGet();
}
{code}
Maybe the RS can just suicide after the FATAL exception since it cannot append
WAL to hdfs?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)