[jira] [Commented] (HBASE-14317) Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL

stack (JIRA) Fri, 04 Sep 2015 00:43:11 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-14317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14730452#comment-14730452
 ]


stack commented on HBASE-14317:
-------------------------------

Testing branch-1, I found this little hole. I committed it as an addendum to 
master branch. The patch is included what I've posted for branch-1.

{code}
diff --git 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java
 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java
index 5708c30..c421f5c 100644
--- 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java
+++ 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java
@@ -878,8 +878,19 @@ public class FSHLog implements WAL {
         // Let the writer thread go regardless, whether error or not.
         if (zigzagLatch != null) {
           zigzagLatch.releaseSafePoint();
-          // It will be null if we failed our wait on safe point above.
-          if (syncFuture != null) blockOnSync(syncFuture);
+          // syncFuture will be null if we failed our wait on safe point 
above. Otherwise, if
+          // latch was obtained successfully, the sync we threw in either 
trigger the latch or it
+          // got stamped with an exception because the WAL was damaged and we 
could not sync. Now
+          // the write pipeline has been opened up again by releasing the safe 
point, process the
+          // syncFuture we got above. This is probably a noop but it may be 
stale exception from
+          // when old WAL was in place. Catch it if so.
+          if (syncFuture != null) {
+            try {
+              blockOnSync(syncFuture);
+            } catch (IOException ioe) {
+              if (LOG.isTraceEnabled()) LOG.trace("Stale sync exception", ioe);
+            }
+          }
         }
       } finally {
         scope.close();
{code}

> Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL
> -----------------------------------------------------
>
>                 Key: HBASE-14317
>                 URL: https://issues.apache.org/jira/browse/HBASE-14317
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 1.2.0, 1.1.1
>            Reporter: stack
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 2.0.0, 1.2.0, 1.0.3, 1.1.3
>
>         Attachments: 14317.branch-1.txt, 14317.test.txt, 14317v10.txt, 
> 14317v11.txt, 14317v12.txt, 14317v13.txt, 14317v14.txt, 14317v15.txt, 
> 14317v5.branch-1.2.txt, 14317v5.txt, 14317v9.txt, HBASE-14317-v1.patch, 
> HBASE-14317-v2.patch, HBASE-14317-v3.patch, HBASE-14317-v4.patch, 
> HBASE-14317.patch, [Java] RS stuck on WAL sync to a dead DN - 
> Pastebin.com.html, append-only-test.patch, raw.php, repro.txt, san_dump.txt, 
> subset.of.rs.log
>
>
> hbase-1.1.1 and hadoop-2.7.1
> We try to roll logs because can't append (See HDFS-8960) but we get stuck. 
> See attached thread dump and associated log. What is interesting is that 
> syncers are waiting to take syncs to run and at same time we want to flush so 
> we are waiting on a safe point but there seems to be nothing in our ring 
> buffer; did we go to roll log and not add safe point sync to clear out 
> ringbuffer?
> Needs a bit of study. Try to reproduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14317) Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL

Reply via email to