Mingjie Lai created HBASE-13782:
-----------------------------------

             Summary: RS stuck after FATAL ``FSHLog: Could not append.''
                 Key: HBASE-13782
                 URL: https://issues.apache.org/jira/browse/HBASE-13782
             Project: HBase
          Issue Type: Bug
          Components: wal
    Affects Versions: 1.0.1
         Environment: hbaes version: 1.0.0-cdh5.4.0
hadoop version: 2.6.0-cdh5.4.0 
            Reporter: Mingjie Lai
            Priority: Critical


hbaes version: 1.0.0-cdh5.4.0
hadoop version: 2.6.0-cdh5.4.0 


Environment: 40-node hadoop cluster shared with a 10-node hbase cluster and a 
30-node yarn.

We started to see that one RS stopped to serve any client request since 
2015-05-26 01:05:33, while all other RS were okay. I checked RS log and found 
that there are some FATAL logs when 
org.apache.hadoop.hbase.regionserver.wal.FSHLog tried to append() and sync{}:

{code}

2015-05-26 01:05:33,700 FATAL org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
Could not append. Requesting close of wal
java.io.IOException: Bad connect ack with firstBadLink as 10.28.1.17:50010
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:600)
2015-05-26 01:05:33,700 FATAL org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
Could not append. Requesting close of wal
java.io.IOException: Bad connect ack with firstBadLink as 10.28.1.17:50010
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:600)
2015-05-26 01:05:33,700 FATAL org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
Could not append. Requesting close of wal
java.io.IOException: Bad connect ack with firstBadLink as 10.28.1.17:50010
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:600)
2015-05-26 01:05:33,700 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
Archiving 
hdfs://nameservice1/hbase/WALs/hbase08.company.com,60020,1431985722474/hbase08.company.com%2C60020%2C1431985722474.default.1432602140966
 to 
hdfs://nameservice1/hbase/oldWALs/hbase08.company.com%2C60020%2C1431985722474.default.1432602140966
2015-05-26 01:05:33,701 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
Error syncing, request close of wal 

{code}

Since the HDFS cluster is shared with a YARN cluster, at the time, there were 
some io heavy jobs running, and exhausted xciever at some of the DNs at the 
exact same time. I think it's the reason why the RS got ``java.io.IOException: 
Bad connect ack with firstBadLink''

The problem is, the RS got stuck without any response since then. 
flushQueueLength grew to the ceiling and stayed there. The only log entries are 
from periodicFlusher:

{code}
2015-05-26 02:06:26,742 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: 
regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher requesting 
flush for region 
myns:mytable,3992+80bb1,1432526964367.c4906e519c1f8206a284c66a8eda2159. after a 
delay of 11000
2015-05-26 02:06:26,742 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: 
regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher requesting 
flush for region 
myns:mytable,0814+0416,1432541066864.cf42d5ab47e051d69e516971e82e84be. after a 
delay of 7874
2015-05-26 02:06:26,742 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: 
regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher requesting 
flush for region 
myns:mytable,2022+7a571,1432528246524.299c1d4bb28fda2a4d9f248c6c22153c. after a 
delay of 22740
2015-05-26 02:06:26,742 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: 
regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher requesting 
flush for region 
myns:mytable,2635+b9b677,1432540367215.749efc885317a2679e2ea39bb0845fbe. after 
a delay of 3162
2015-05-26 02:06:26,742 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: 
regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher requesting 
flush for region 
myns:mytable,0401+985e,1432527151473.eb97576381fce10a9616efd471103920. after a 
delay of 9142
{code}

Looks like there is a RS level deadlock triggered by the FATAL append exception 
handling. In the end, I had to restart the RS service to rescue the regions 
from the stuck RS.

{code}
      } catch (Exception e) {
        LOG.fatal("Could not append. Requesting close of wal", e);
        requestLogRoll();
        throw e;
      }
      numEntries.incrementAndGet();
    }
{code}

Maybe the RS can just suicide after the FATAL exception since it cannot append 
WAL to hdfs? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to