[
https://issues.apache.org/jira/browse/HBASE-14317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
stack updated HBASE-14317:
--------------------------
Attachment: repro.txt
Repro of the hang seen in the original attachment raw.php. We cannot replace
log because we are waiting on the zig zag latch. We cannot close the region
because we are waiting on a flush. Flushes cannot progress because they are
waiting on their sequenceid. Test timeouts after 60 seconds of hang. Test is
testLockedUpWALSystem. Test is ugly because have to standup a region and log
roller... full of boilerplate mostly. Also reverts HBASE-13971. It only
confuses. Adds a method to FSHLog so I can hold processing around zigzaglatch
creation.
Hang happens if a sync comes off the ring buffer AFTER we've created a
SafePointZigZagLatch -- the very existence of this object means the ringbuffer
consuming thread will fall into the attain safe point code block (even if the
interrupting sync just throws an exception) -- but BEFORE we have published the
replaceWriter zigzag sync on to the ringbuffer: i.e if the sync comes in AFTER
line #794 in the below but BEFORE #806.
{code}
784 Path replaceWriter(final Path oldPath, final Path newPath, Writer
nextWriter,
785 final FSDataOutputStream nextHdfsOut)
786 throws IOException {
787 // Ask the ring buffer writer to pause at a safe point. Once we do
this, the writer
788 // thread will eventually pause. An error hereafter needs to release
the writer thread
789 // regardless -- hence the finally block below. Note, this method is
called from the FSHLog
790 // constructor BEFORE the ring buffer is set running so it is null on
first time through
791 // here; allow for that.
792 SyncFuture syncFuture = null;
793 SafePointZigZagLatch zigzagLatch = (this.ringBufferEventHandler ==
null)?
794 null: this.ringBufferEventHandler.attainSafePoint();
795 afterZigZagLatch();
796 TraceScope scope = Trace.startSpan("FSHFile.replaceWriter");
797 try {
798 // Wait on the safe point to be achieved. Send in a sync in case
nothing has hit the
799 // ring buffer between the above notification of writer that we want
it to go to
800 // 'safe point' and then here where we are waiting on it to attain
safe point. Use
801 // 'sendSync' instead of 'sync' because we do not want this thread
to block waiting on it
802 // to come back. Cleanup this syncFuture down below after we are
ready to run again.
803 try {
804 if (zigzagLatch != null) {
805 Trace.addTimelineAnnotation("awaiting safepoint");
806 syncFuture =
zigzagLatch.waitSafePoint(publishSyncOnRingBuffer());
...{code}
Fix is here abouts:
{code}
private void attainSafePoint(final long currentSequence) {
if (this.zigzagLatch == null || !this.zigzagLatch.isCocked()) return;
...
{code}
.... needs to be more than existence of zigzagLatch and that it is cocked...
Let me chat w/ [~eclark]
> Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL
> -----------------------------------------------------
>
> Key: HBASE-14317
> URL: https://issues.apache.org/jira/browse/HBASE-14317
> Project: HBase
> Issue Type: Bug
> Affects Versions: 1.2.0, 1.1.1
> Reporter: stack
> Priority: Critical
> Attachments: 14317.test.txt, HBASE-14317-v1.patch,
> HBASE-14317-v2.patch, HBASE-14317-v3.patch, HBASE-14317-v4.patch,
> HBASE-14317.patch, [Java] RS stuck on WAL sync to a dead DN -
> Pastebin.com.html, append-only-test.patch, raw.php, repro.txt, san_dump.txt,
> subset.of.rs.log
>
>
> hbase-1.1.1 and hadoop-2.7.1
> We try to roll logs because can't append (See HDFS-8960) but we get stuck.
> See attached thread dump and associated log. What is interesting is that
> syncers are waiting to take syncs to run and at same time we want to flush so
> we are waiting on a safe point but there seems to be nothing in our ring
> buffer; did we go to roll log and not add safe point sync to clear out
> ringbuffer?
> Needs a bit of study. Try to reproduce.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)