[jira] [Updated] (HBASE-14317) Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL

stack (JIRA) Mon, 31 Aug 2015 22:52:04 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-14317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


stack updated HBASE-14317:
--------------------------
    Attachment: repro.txt

Repro of the hang seen in the original attachment raw.php. We cannot replace 
log because we are waiting on the zig zag latch. We cannot close the region 
because we are waiting on a flush. Flushes cannot progress because they are 
waiting on their sequenceid. Test timeouts after 60 seconds of hang. Test is 
testLockedUpWALSystem. Test is ugly because have to standup a region and log 
roller... full of boilerplate mostly. Also reverts HBASE-13971. It only 
confuses. Adds a method to FSHLog so I can hold processing around zigzaglatch 
creation.

Hang happens if a sync comes off the ring buffer AFTER we've created a 
SafePointZigZagLatch -- the very existence of this object means the ringbuffer 
consuming thread will fall into the attain safe point code block (even if the 
interrupting sync just throws an exception) -- but BEFORE we have published the 
replaceWriter zigzag sync on to the ringbuffer: i.e if the sync comes in AFTER 
line #794 in the below but BEFORE #806.

{code}
 784   Path replaceWriter(final Path oldPath, final Path newPath, Writer 
nextWriter,
 785       final FSDataOutputStream nextHdfsOut)
 786   throws IOException {
 787     // Ask the ring buffer writer to pause at a safe point.  Once we do 
this, the writer
 788     // thread will eventually pause. An error hereafter needs to release 
the writer thread
 789     // regardless -- hence the finally block below.  Note, this method is 
called from the FSHLog
 790     // constructor BEFORE the ring buffer is set running so it is null on 
first time through
 791     // here; allow for that.
 792     SyncFuture syncFuture = null;
 793     SafePointZigZagLatch zigzagLatch = (this.ringBufferEventHandler == 
null)?
 794       null: this.ringBufferEventHandler.attainSafePoint();
 795     afterZigZagLatch();
 796     TraceScope scope = Trace.startSpan("FSHFile.replaceWriter");
 797     try {
 798       // Wait on the safe point to be achieved.  Send in a sync in case 
nothing has hit the
 799       // ring buffer between the above notification of writer that we want 
it to go to
 800       // 'safe point' and then here where we are waiting on it to attain 
safe point.  Use
 801       // 'sendSync' instead of 'sync' because we do not want this thread 
to block waiting on it
 802       // to come back.  Cleanup this syncFuture down below after we are 
ready to run again.
 803       try {
 804         if (zigzagLatch != null) {
 805           Trace.addTimelineAnnotation("awaiting safepoint");
 806           syncFuture = 
zigzagLatch.waitSafePoint(publishSyncOnRingBuffer());
...{code}

Fix is here abouts:

{code}
    private void attainSafePoint(final long currentSequence) {
      if (this.zigzagLatch == null || !this.zigzagLatch.isCocked()) return;
...
{code}

.... needs to be more than existence of zigzagLatch and that it is cocked...  
Let me chat w/ [~eclark]





> Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL
> -----------------------------------------------------
>
>                 Key: HBASE-14317
>                 URL: https://issues.apache.org/jira/browse/HBASE-14317
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 1.2.0, 1.1.1
>            Reporter: stack
>            Priority: Critical
>         Attachments: 14317.test.txt, HBASE-14317-v1.patch, 
> HBASE-14317-v2.patch, HBASE-14317-v3.patch, HBASE-14317-v4.patch, 
> HBASE-14317.patch, [Java] RS stuck on WAL sync to a dead DN - 
> Pastebin.com.html, append-only-test.patch, raw.php, repro.txt, san_dump.txt, 
> subset.of.rs.log
>
>
> hbase-1.1.1 and hadoop-2.7.1
> We try to roll logs because can't append (See HDFS-8960) but we get stuck. 
> See attached thread dump and associated log. What is interesting is that 
> syncers are waiting to take syncs to run and at same time we want to flush so 
> we are waiting on a safe point but there seems to be nothing in our ring 
> buffer; did we go to roll log and not add safe point sync to clear out 
> ringbuffer?
> Needs a bit of study. Try to reproduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14317) Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL

Reply via email to