[jira] [Updated] (HBASE-14317) Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL

stack (JIRA) Tue, 01 Sep 2015 16:05:07 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-14317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


stack updated HBASE-14317:
--------------------------
    Attachment: 14317v5.txt

Thanks for input [~eclark]

On your first comment, I think the fact that consumer is single-threaded lets 
us reason about who we can stamp on (though syncs are running elsewhere on 
their own threads) and I think I agree with your second comment.

Here is a patch to throw exception if the append fails, even if sync succeeds 
(in fact anything after a failed append will fail until the WAL is replaced). 
It also fixes the lock up. Reverts HBASE-13971. Will work some more on it 
making tests more stringent.

+ Adds to mvcc a new cancelMemstoreInsert that removes entry from Q and does 
NOT advance read point (w/o this change, we were trying to complete the 
memstore insert but the sequenceid was far in excess of the last successful 
sync -- especially on failure... we'd get stuck).
+ In FSHLog, keep around exception thrown when appending. Throw same exception 
for all subsequent appends. Fail syncs too. Do this till WAL has been changed 
out from under us. Changed the wait on zigzaglatch so it checks if outstanding 
syncs. There may be none if syncs just fail. Need this to break loop also for 
case when syncs are failing and are NOT going to up the sequence id beyond 
where we want it so we can break out.
+ TestHRegion, added tests for both conditions (Elliott did the append test 
stuff).


> Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL
> -----------------------------------------------------
>
>                 Key: HBASE-14317
>                 URL: https://issues.apache.org/jira/browse/HBASE-14317
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 1.2.0, 1.1.1
>            Reporter: stack
>            Priority: Blocker
>             Fix For: 2.0.0, 1.2.0, 1.0.3, 1.1.3
>
>         Attachments: 14317.test.txt, 14317v5.txt, HBASE-14317-v1.patch, 
> HBASE-14317-v2.patch, HBASE-14317-v3.patch, HBASE-14317-v4.patch, 
> HBASE-14317.patch, [Java] RS stuck on WAL sync to a dead DN - 
> Pastebin.com.html, append-only-test.patch, raw.php, repro.txt, san_dump.txt, 
> subset.of.rs.log
>
>
> hbase-1.1.1 and hadoop-2.7.1
> We try to roll logs because can't append (See HDFS-8960) but we get stuck. 
> See attached thread dump and associated log. What is interesting is that 
> syncers are waiting to take syncs to run and at same time we want to flush so 
> we are waiting on a safe point but there seems to be nothing in our ring 
> buffer; did we go to roll log and not add safe point sync to clear out 
> ringbuffer?
> Needs a bit of study. Try to reproduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14317) Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL

Reply via email to