[
https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122682#comment-14122682
]
Victor Xu commented on HBASE-11902:
-----------------------------------
Yes, stack. The rs main thread is waiting at
org.apache.hadoop.hbase.util.DrainBarrier.stopAndDrainOps, but the main cause
of the aborting is DataNodes. You can find the details in the log:
2014-09-03 13:38:03,789 FATAL org.apache.hadoop.hbase.regionserver.wal.FSHLog:
Error while AsyncSyncer sync, request close of hlog
java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting...
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1127)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)
2014-09-03 13:38:03,799 ERROR
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Cache flush failed for
region page_content_queue,00166,1408946731655.8671b8a0f82565f88eb2ab8a5b53e84c.
java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting...
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1127)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)
2014-09-03 13:38:03,801 ERROR
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter: Got IOException
while writing trailer
java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting...
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1127)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)
2014-09-03 13:38:03,802 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog:
Failed close of HLog writer
java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting...
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1127)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)
2014-09-03 13:38:03,802 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog:
Riding over HLog close failure! error count=1
2014-09-03 13:38:03,804 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog:
Rolled WAL
/hbase/WALs/hadoop461.cm6.tbsite.net,60020,1409003284950/hadoop461.cm6.tbsite.net%2C60020%2C1409003284950.1409722420708
with entries=32565, filesize=118.6 M; new WAL
/hbase/WALs/hadoop461.cm6.tbsite.net,60020,1409003284950/hadoop461.cm6.tbsite.net%2C60020%2C1409003284950.1409722683780
2014-09-03 13:38:03,804 DEBUG org.apache.hadoop.hbase.regionserver.wal.FSHLog:
log file is ready for archiving
hdfs://hadoopnnvip.cm6:9000/hbase/WALs/hadoop461.cm6.tbsite.net,60020,1409003284950/hadoop461.cm6.tbsite.net%2C60020%2C1409003284950.1409707475254
2014-09-03 13:38:03,804 DEBUG org.apache.hadoop.hbase.regionserver.wal.FSHLog:
log file is ready for archiving
hdfs://hadoopnnvip.cm6:9000/hbase/WALs/hadoop461.cm6.tbsite.net,60020,1409003284950/hadoop461.cm6.tbsite.net%2C60020%2C1409003284950.1409707722202
2014-09-03 13:38:03,804 DEBUG org.apache.hadoop.hbase.regionserver.wal.FSHLog:
log file is ready for archiving
hdfs://hadoopnnvip.cm6:9000/hbase/WALs/hadoop461.cm6.tbsite.net,60020,1409003284950/hadoop461.cm6.tbsite.net%2C60020%2C1409003284950.1409707946159
2014-09-03 13:38:03,804 DEBUG org.apache.hadoop.hbase.regionserver.wal.FSHLog:
log file is ready for archiving
hdfs://hadoopnnvip.cm6:9000/hbase/WALs/hadoop461.cm6.tbsite.net,60020,1409003284950/hadoop461.cm6.tbsite.net%2C60020%2C1409003284950.1409708155788
2014-09-03 13:38:03,839 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Flush requested on
page_content_queue,00166,1408946731655.8671b8a0f82565f88eb2ab8a5b53e84c.
2014-09-03 13:38:03,839 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Started memstore flush for
page_content_queue,00166,1408946731655.8671b8a0f82565f88eb2ab8a5b53e84c.,
current region memstore size 218.5 M
2014-09-03 13:38:03,887 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Flush requested on
page_content_queue,00166,1408946731655.8671b8a0f82565f88eb2ab8a5b53e84c.
2014-09-03 13:38:03,887 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Started memstore flush for
page_content_queue,00166,1408946731655.8671b8a0f82565f88eb2ab8a5b53e84c.,
current region memstore size 218.5 M
2014-09-03 13:38:03,897 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Flush requested on
page_content_queue,00166,1408946731655.8671b8a0f82565f88eb2ab8a5b53e84c.
2014-09-03 13:38:04,699 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: One or more
threads are no longer alive -- stop
2014-09-03 13:38:04,699 INFO org.apache.hadoop.ipc.RpcServer: Stopping server
on 60020
> RegionServer was blocked while aborting
> ---------------------------------------
>
> Key: HBASE-11902
> URL: https://issues.apache.org/jira/browse/HBASE-11902
> Project: HBase
> Issue Type: Bug
> Components: regionserver, wal
> Affects Versions: 0.98.4
> Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7
> Reporter: Victor Xu
> Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log,
> jstack_hadoop461.cm6.log
>
>
> Generally, regionserver automatically aborts when isHealth() returns false.
> But it sometimes got blocked while aborting. I saved the jstack and logs, and
> found out that it was caused by datanodes failures. The "regionserver60020"
> thread was blocked while closing WAL.
> This issue doesn't happen so frequently, but if it happens, it always leads
> to huge amount of requests failure. The only way to do is KILL -9.
> I think it's a bug, but I haven't found a decent solution. Does anyone have
> the same problem?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)