[jira] [Commented] (HBASE-7006) [MTTR] Improve Region Server Recovery Time - Distributed Log Replay

Enis Soztutar (JIRA) Fri, 07 Jun 2013 13:15:33 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13678382#comment-13678382
 ]


Enis Soztutar commented on HBASE-7006:
--------------------------------------

Here is a proposed scheme that can solve this problem:

The region will be opened for replaying as in previous. The normal writes go to 
the memstore, and the memstore is flushed as usual. The region servers who are 
reading the WAL and sending to the replaying RS will still be the same, except 
for the fact that the edits are sent with their seq_ids. 

On the replaying RS, for all regions that are in replaying state, there is a 
single buffer. All edits are appended to this buffer without any sorting. This 
buffer can be accounted as a memstore, and it will have the memstore flush size 
as max size. Once this is reached, or due to global memstore pressure, we are 
asked to flush, we do spill this to disk after sorting. This buffer keeps 
<kv,seq> pairs, and sorts according to <kv,seq>. If there is not memory 
pressure, and buffer does not fill up, we don't need to spill to disk. 

Once the replaying is finished, and master asks the region server to open the 
region for reading, then we do a final merge sort for the in-memory sorted 
buffer, and all on-disk spilled buffers and create an hfile, discarding kv's 
that have the same kv, but smaller seq_id. This file will be a single hfile 
that corresponds to a flush. This hfile will have a seq_id that is obtained 
from the wal edits. Then we add this hfile to the store, and open the region as 
usual. This kind of keeping an unsorted buffer, and sorting it with qsort with 
spills and final on-disk merge sort might even be faster, since otherwise, we 
would be doing an insertion to the memstore, which becomes an insertion sort. 

The other thing we need to change is that replayed edits will not go into the 
wal again, so we keep track of recovering state for the region server, and 
re-do the work if there is a subsequent failure. 

In sort, this will be close to the BigTable's in-memory sort for each WAL file 
approach, but instead we gather the edits for the region from all WAL files by 
doing the replay RPC, and do the sort per region. End result, we create a 
flushed hfile, as if the region just flushed before the crash. 
                
> [MTTR] Improve Region Server Recovery Time - Distributed Log Replay
> -------------------------------------------------------------------
>
>                 Key: HBASE-7006
>                 URL: https://issues.apache.org/jira/browse/HBASE-7006
>             Project: HBase
>          Issue Type: New Feature
>          Components: MTTR
>            Reporter: stack
>            Assignee: Jeffrey Zhong
>            Priority: Critical
>             Fix For: 0.98.0, 0.95.1
>
>         Attachments: 7006-addendum-3.txt, hbase-7006-addendum.patch, 
> hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v4.patch, hbase-7006-combined-v5.patch, 
> hbase-7006-combined-v6.patch, hbase-7006-combined-v7.patch, 
> hbase-7006-combined-v8.patch, hbase-7006-combined-v9.patch, LogSplitting 
> Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-7006) [MTTR] Improve Region Server Recovery Time - Distributed Log Replay

Reply via email to