[ 
https://issues.apache.org/jira/browse/HBASE-16824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15576576#comment-15576576
 ] 

Enis Soztutar commented on HBASE-16824:
---------------------------------------

The main problem is this: 
 - We use the SafePointZigZagLatch to coordinate the safe point between the log 
roller thread and the RingBufferEventHandler thread. 
 - LogRoller starts the safe point process by signaling to the RBEH to start 
attaining the safe point. 
 - RBEH sees this request and waits until the sync point is past the sequence 
of the last item in the batch. By this time, every thing should already be 
appended and waiting for the sync. 
 - RBEH waits for the highest synced sequence id to be greater or equal to the 
waiting sequence id which makes sure that the writer.sync() completes and data 
is safe. This loop:
{code}
        while ((!this.shutdown && this.zigzagLatch.isCocked()
            && highestSyncedTxid.get() < currentSequence &&
            // We could be in here and all syncs are failing or failed. Check 
for this. Otherwise
            // we'll just be stuck here for ever. In other words, ensure there 
syncs running.
            isOutstandingSyncs())
{code}
 - However, even though the {{highestSyncedTxid.get() >= currentSequence}} at 
this point, some other SyncRunner thread may still be trying to sync entries 
which are less then highestSyncedTxid. We have an optimization to return early 
without calling {{writer.sync()}}, but we cannot rely on that (because of 
thread scheduling can happen in between the check and writer.sync() call. 
 - This results in a case where we have already closed and replaced the writer, 
but a LogSyncer thread calls writer.sync() on an already closed stream. All the 
SyncFutures then will get Exceptions rather than the success result (it should 
succeed because a higher trx id is already sync'ed). 
 - The fix is simple conceptually. We have to also wait for all the SyncRunner 
threads to finish their work at the attainSafePoint. 

> Make replacement of path the first operation during WAL rotation
> ----------------------------------------------------------------
>
>                 Key: HBASE-16824
>                 URL: https://issues.apache.org/jira/browse/HBASE-16824
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Atri Sharma
>
> In https://issues.apache.org/jira/browse/HBASE-12074, we hit an error if an 
> async thread calls flush on a WAL record already closed as the WAL is being 
> rotated. This JIRA investigates if setting the new WAL record path as the 
> first operation during WAL rotation will fix the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to