[jira] [Commented] (HBASE-14317) Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL

stack (JIRA) Thu, 27 Aug 2015 21:40:00 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-14317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14718051#comment-14718051
 ]


stack commented on HBASE-14317:
-------------------------------

This is from the attached log from original complaint:

{code}
2015-08-23 07:22:26,060 FATAL 
[regionserver/r12s16.sjc.aristanetworks.com/172.24.32.16:9104.append-pool1-t1] 
wal.FSHLog: Could not append. Requesting close of wal
java.io.IOException: Failed to replace a bad datanode on the existing pipeline 
due to no more good datanodes being available to try. (Nodes: 
current=[172.24.32.16:10110, 172.24.32.13:10110], original=[172.24.32.16:10110, 
172.24.32.13:10110]). The current failed datanode replacement policy is 
DEFAULT, and a client may configure this via 
'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
configuration.
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:969)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1035)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1184)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:933)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:487)
{code}

it looks like yours in that the complaint is that we cannot append.

If I manufacture a failed append, I can get a hang. It is this logic in the 
finally for HRegion#doMiniBatchMutation .. and probably in all other places we 
do the append/sync dance. At the end of step 5, we do the WAL append and if we 
get an IOE, which is what you have pasted and is what we have in original 
complaint's log, then we go to the finally:

{code}
    } finally {
      // if the wal sync was unsuccessful, remove keys from memstore
      if (doRollBackMemstore) {
        rollbackMemstore(memstoreCells);
      }
      if (w != null) {
        mvcc.completeMemstoreInsertWithSeqNum(w, walKey);
      }
...
{code}

The rollback of edits if fine but w is not null in the above and we go to 
complete the insert in mvcc and inside here, we ask the walKey for its 
sequenceid... which is assigned AFTER we append ... only the append failed.  So 
we wait...

Let me look a bit more.

I think your patch would break a wait on safe point but am not sure it would 
unblock all threads. Let me try and manufacture safepoint waiters too.  Will be 
back.







> Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL
> -----------------------------------------------------
>
>                 Key: HBASE-14317
>                 URL: https://issues.apache.org/jira/browse/HBASE-14317
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 1.2.0, 1.1.1
>            Reporter: stack
>            Priority: Critical
>         Attachments: HBASE-14317.patch, [Java] RS stuck on WAL sync to a dead 
> DN - Pastebin.com.html, raw.php, subset.of.rs.log
>
>
> hbase-1.1.1 and hadoop-2.7.1
> We try to roll logs because can't append (See HDFS-8960) but we get stuck. 
> See attached thread dump and associated log. What is interesting is that 
> syncers are waiting to take syncs to run and at same time we want to flush so 
> we are waiting on a safe point but there seems to be nothing in our ring 
> buffer; did we go to roll log and not add safe point sync to clear out 
> ringbuffer?
> Needs a bit of study. Try to reproduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14317) Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL

Reply via email to