[ 
https://issues.apache.org/jira/browse/HBASE-14317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725888#comment-14725888
 ] 

stack commented on HBASE-14317:
-------------------------------

Trying out this for a fix for the hang: i.e. we can fall into the wait on 
zigzaglatch though all outstanding appends and syncs are failing and will never 
complete (and up the sync number to overwhelm current sequence id).

{code}
@@ -1792,9 +1797,10 @@ public class FSHLog implements WAL {
       // If here, another thread is waiting on us to get to safe point.  Don't 
leave it hanging.
       try {
         // Wait on outstanding syncers; wait for them to finish syncing 
(unless we've been
-        // shutdown or unless our latch has been thrown because we have been 
aborted).
+        // shutdown or unless our latch has been thrown because we have been 
aborted or unless
+        // this WAL is broken and we can't get a sync/append to complete).
         while (!this.shutdown && this.zigzagLatch.isCocked() &&
-            highestSyncedSequence.get() < currentSequence) {
+            highestSyncedSequence.get() < currentSequence && 
this.syncFuturesCount > 0) {
{code}

Will be back on the highly-unlikely but possible case where an append fails but 
sync does not (a sync may be ongoing at time of append and may 'finish' after 
the append 'succesfully' so... let me see)

> Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL
> -----------------------------------------------------
>
>                 Key: HBASE-14317
>                 URL: https://issues.apache.org/jira/browse/HBASE-14317
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 1.2.0, 1.1.1
>            Reporter: stack
>            Priority: Critical
>         Attachments: 14317.test.txt, HBASE-14317-v1.patch, 
> HBASE-14317-v2.patch, HBASE-14317-v3.patch, HBASE-14317-v4.patch, 
> HBASE-14317.patch, [Java] RS stuck on WAL sync to a dead DN - 
> Pastebin.com.html, append-only-test.patch, raw.php, repro.txt, san_dump.txt, 
> subset.of.rs.log
>
>
> hbase-1.1.1 and hadoop-2.7.1
> We try to roll logs because can't append (See HDFS-8960) but we get stuck. 
> See attached thread dump and associated log. What is interesting is that 
> syncers are waiting to take syncs to run and at same time we want to flush so 
> we are waiting on a safe point but there seems to be nothing in our ring 
> buffer; did we go to roll log and not add safe point sync to clear out 
> ringbuffer?
> Needs a bit of study. Try to reproduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to