[
https://issues.apache.org/jira/browse/HBASE-14317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
stack updated HBASE-14317:
--------------------------
Attachment: 14317v5.txt
Thanks for input [~eclark]
On your first comment, I think the fact that consumer is single-threaded lets
us reason about who we can stamp on (though syncs are running elsewhere on
their own threads) and I think I agree with your second comment.
Here is a patch to throw exception if the append fails, even if sync succeeds
(in fact anything after a failed append will fail until the WAL is replaced).
It also fixes the lock up. Reverts HBASE-13971. Will work some more on it
making tests more stringent.
+ Adds to mvcc a new cancelMemstoreInsert that removes entry from Q and does
NOT advance read point (w/o this change, we were trying to complete the
memstore insert but the sequenceid was far in excess of the last successful
sync -- especially on failure... we'd get stuck).
+ In FSHLog, keep around exception thrown when appending. Throw same exception
for all subsequent appends. Fail syncs too. Do this till WAL has been changed
out from under us. Changed the wait on zigzaglatch so it checks if outstanding
syncs. There may be none if syncs just fail. Need this to break loop also for
case when syncs are failing and are NOT going to up the sequence id beyond
where we want it so we can break out.
+ TestHRegion, added tests for both conditions (Elliott did the append test
stuff).
> Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL
> -----------------------------------------------------
>
> Key: HBASE-14317
> URL: https://issues.apache.org/jira/browse/HBASE-14317
> Project: HBase
> Issue Type: Bug
> Affects Versions: 1.2.0, 1.1.1
> Reporter: stack
> Priority: Blocker
> Fix For: 2.0.0, 1.2.0, 1.0.3, 1.1.3
>
> Attachments: 14317.test.txt, 14317v5.txt, HBASE-14317-v1.patch,
> HBASE-14317-v2.patch, HBASE-14317-v3.patch, HBASE-14317-v4.patch,
> HBASE-14317.patch, [Java] RS stuck on WAL sync to a dead DN -
> Pastebin.com.html, append-only-test.patch, raw.php, repro.txt, san_dump.txt,
> subset.of.rs.log
>
>
> hbase-1.1.1 and hadoop-2.7.1
> We try to roll logs because can't append (See HDFS-8960) but we get stuck.
> See attached thread dump and associated log. What is interesting is that
> syncers are waiting to take syncs to run and at same time we want to flush so
> we are waiting on a safe point but there seems to be nothing in our ring
> buffer; did we go to roll log and not add safe point sync to clear out
> ringbuffer?
> Needs a bit of study. Try to reproduce.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)