[
https://issues.apache.org/jira/browse/HBASE-14317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14718051#comment-14718051
]
stack commented on HBASE-14317:
-------------------------------
This is from the attached log from original complaint:
{code}
2015-08-23 07:22:26,060 FATAL
[regionserver/r12s16.sjc.aristanetworks.com/172.24.32.16:9104.append-pool1-t1]
wal.FSHLog: Could not append. Requesting close of wal
java.io.IOException: Failed to replace a bad datanode on the existing pipeline
due to no more good datanodes being available to try. (Nodes:
current=[172.24.32.16:10110, 172.24.32.13:10110], original=[172.24.32.16:10110,
172.24.32.13:10110]). The current failed datanode replacement policy is
DEFAULT, and a client may configure this via
'dfs.client.block.write.replace-datanode-on-failure.policy' in its
configuration.
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:969)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1035)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1184)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:933)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:487)
{code}
it looks like yours in that the complaint is that we cannot append.
If I manufacture a failed append, I can get a hang. It is this logic in the
finally for HRegion#doMiniBatchMutation .. and probably in all other places we
do the append/sync dance. At the end of step 5, we do the WAL append and if we
get an IOE, which is what you have pasted and is what we have in original
complaint's log, then we go to the finally:
{code}
} finally {
// if the wal sync was unsuccessful, remove keys from memstore
if (doRollBackMemstore) {
rollbackMemstore(memstoreCells);
}
if (w != null) {
mvcc.completeMemstoreInsertWithSeqNum(w, walKey);
}
...
{code}
The rollback of edits if fine but w is not null in the above and we go to
complete the insert in mvcc and inside here, we ask the walKey for its
sequenceid... which is assigned AFTER we append ... only the append failed. So
we wait...
Let me look a bit more.
I think your patch would break a wait on safe point but am not sure it would
unblock all threads. Let me try and manufacture safepoint waiters too. Will be
back.
> Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL
> -----------------------------------------------------
>
> Key: HBASE-14317
> URL: https://issues.apache.org/jira/browse/HBASE-14317
> Project: HBase
> Issue Type: Bug
> Affects Versions: 1.2.0, 1.1.1
> Reporter: stack
> Priority: Critical
> Attachments: HBASE-14317.patch, [Java] RS stuck on WAL sync to a dead
> DN - Pastebin.com.html, raw.php, subset.of.rs.log
>
>
> hbase-1.1.1 and hadoop-2.7.1
> We try to roll logs because can't append (See HDFS-8960) but we get stuck.
> See attached thread dump and associated log. What is interesting is that
> syncers are waiting to take syncs to run and at same time we want to flush so
> we are waiting on a safe point but there seems to be nothing in our ring
> buffer; did we go to roll log and not add safe point sync to clear out
> ringbuffer?
> Needs a bit of study. Try to reproduce.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)