[jira] [Commented] (HBASE-14317) Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL

stack (JIRA) Thu, 03 Sep 2015 11:37:43 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-14317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729538#comment-14729538
 ]


stack commented on HBASE-14317:
-------------------------------

Ran on small cluster (1B ITBLL with monkeys and confirmed all data there). 
Checked logs. No hang or no complaints related to this patch. Just the usual 
complaint about slow HDFS including stuff like this:

2015-09-02 23:56:52,790 WARN  
[regionserver/c2023.halxg.cloudera.com/10.20.84.29:16020.logRoller] 
hdfs.DFSClient: Slow waitForAckedSeqno took 2577ms (threshold=20ms)

Also dfs client complaints and exceptions... but nothing from RS or related to 
WAL.

Looking at the failed test, on the one hand, the lease was just robbed on all 
WALs out from under the cluster. Let me make sure the fail is because of 
stricter semantic and not from any other byproduct. Looking at it, we should be 
able to ride over the HDFS restart. Will be back.


> Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL
> -----------------------------------------------------
>
>                 Key: HBASE-14317
>                 URL: https://issues.apache.org/jira/browse/HBASE-14317
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 1.2.0, 1.1.1
>            Reporter: stack
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 2.0.0, 1.2.0, 1.0.3, 1.1.3
>
>         Attachments: 14317.test.txt, 14317v10.txt, 14317v11.txt, 
> 14317v12.txt, 14317v13.txt, 14317v5.branch-1.2.txt, 14317v5.txt, 14317v9.txt, 
> HBASE-14317-v1.patch, HBASE-14317-v2.patch, HBASE-14317-v3.patch, 
> HBASE-14317-v4.patch, HBASE-14317.patch, [Java] RS stuck on WAL sync to a 
> dead DN - Pastebin.com.html, append-only-test.patch, raw.php, repro.txt, 
> san_dump.txt, subset.of.rs.log
>
>
> hbase-1.1.1 and hadoop-2.7.1
> We try to roll logs because can't append (See HDFS-8960) but we get stuck. 
> See attached thread dump and associated log. What is interesting is that 
> syncers are waiting to take syncs to run and at same time we want to flush so 
> we are waiting on a safe point but there seems to be nothing in our ring 
> buffer; did we go to roll log and not add safe point sync to clear out 
> ringbuffer?
> Needs a bit of study. Try to reproduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14317) Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL

Reply via email to