[jira] [Updated] (HBASE-14317) Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL

stack (JIRA) Wed, 02 Sep 2015 14:35:35 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-14317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


stack updated HBASE-14317:
--------------------------
        Assignee: stack
    Hadoop Flags: Reviewed
    Release Note: 
Tighten up WAL-use semantic.

1. If an append or a sync throws an exception, all subsequent attempts at using 
the log will also throw this same exception. The WAL is now a lame-duck until 
you roll it.
2. If a successful append, and then we fail to sync the append, this is a fatal 
exception. The container must abort to replay the WAL logs even though we have 
told the client that the appends failed.

The above rules have been applied laxly up to this; it used to be possible to 
get a good sync to go in over the top of a failed append. This has been fixed 
in this patch.

Also fixed a hang in the WAL subsystem if a request to pause the write pipeline 
took on a failed sync. before the roll requests sync got scheduled.


TODO: Revisit our WAL system. HBASE-12751 helps rationalize our write pipeline. 
In particular, it manages sequenceid inside mvcc which should make it so we can 
purge mechanism that writes empty, unflushed appends just to get the next 
sequenceid... problematic when WAL goes lame-duck. Lets get it in.
TODO: A successful append followed by a failed sync probably only needs us 
replace the WAL (if we have signalled the client that the appends failed). 
Bummer is that replicating, these last appends might make it to the sink 
cluster or get replayed during recovery. HBase should keep its own WAL length? 
Or sequenceid of last successful sync should be passed when doing recovery and 
replication?

> Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL
> -----------------------------------------------------
>
>                 Key: HBASE-14317
>                 URL: https://issues.apache.org/jira/browse/HBASE-14317
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 1.2.0, 1.1.1
>            Reporter: stack
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 2.0.0, 1.2.0, 1.0.3, 1.1.3
>
>         Attachments: 14317.test.txt, 14317v10.txt, 14317v11.txt, 
> 14317v5.branch-1.2.txt, 14317v5.txt, 14317v9.txt, HBASE-14317-v1.patch, 
> HBASE-14317-v2.patch, HBASE-14317-v3.patch, HBASE-14317-v4.patch, 
> HBASE-14317.patch, [Java] RS stuck on WAL sync to a dead DN - 
> Pastebin.com.html, append-only-test.patch, raw.php, repro.txt, san_dump.txt, 
> subset.of.rs.log
>
>
> hbase-1.1.1 and hadoop-2.7.1
> We try to roll logs because can't append (See HDFS-8960) but we get stuck. 
> See attached thread dump and associated log. What is interesting is that 
> syncers are waiting to take syncs to run and at same time we want to flush so 
> we are waiting on a safe point but there seems to be nothing in our ring 
> buffer; did we go to roll log and not add safe point sync to clear out 
> ringbuffer?
> Needs a bit of study. Try to reproduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14317) Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL

Reply via email to