[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13676458#comment-13676458
 ] 

Jeffrey Zhong commented on HBASE-7006:
--------------------------------------

I think about the issue whole morning. Also I discussed this with other folks. 
Basically the root issue is to maintain the receiving order during recovery for 
puts with exact same key + version(timestamp). Since log recovery process could 
work on multiple wal files at same time, the order of replay isn't guaranteed 
to be in the receiving order. I'm listing several options below to see how 
others think.

h5. Option one(the simplest one) 
Document this limitation in the release note. Assuming the same version update 
is a rare usage pattern.

h5. Option two(still simple but hacky)
a) disallow writes during recovery
b) hold flush till all wals of a recovering region are replayed. Memstore 
should hold because we only recover unflushed wal edits.

h5. Option three(multiple memstores)
a) Let splitlogworker pick wals of a failed RS in order instead of random. Say 
a failed RS has WAL1, WAL2, WAL3,... WALk. a worker will only pick WAL2 if WAL1 
is done(or errored) etc.
b) During replay, we pass original wal sequence ids of edits to the receiving RS
c) In receiving RS, we bucket WAL files to a different memstore during 
replaying and use the original sequence Ids. Say wal1-wal4 to memstore1, 
wal5-wal10 to memstore2 etc. We only flush the bucket memstore when all wals 
inside the bucket are replayed. all wals can be replayed concurrently.
d) writes from normal traffic(allow writes during recovery) are put in a 
different memstore as of today and flush normally with new sequenceIds.  

h5. Option four
a) During replay, we pass original wal sequence ids
b) for each wal edit, we store each edit's sequence id along with its key. 
c) during scanning, we use the original sequence id if it's present otherwise 
its store file sequence Id
d) compaction can just leave put with max sequence id





     
                
> [MTTR] Improve Region Server Recovery Time - Distributed Log Replay
> -------------------------------------------------------------------
>
>                 Key: HBASE-7006
>                 URL: https://issues.apache.org/jira/browse/HBASE-7006
>             Project: HBase
>          Issue Type: New Feature
>          Components: MTTR
>            Reporter: stack
>            Assignee: Jeffrey Zhong
>            Priority: Critical
>             Fix For: 0.98.0, 0.95.1
>
>         Attachments: 7006-addendum-3.txt, hbase-7006-addendum.patch, 
> hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v4.patch, hbase-7006-combined-v5.patch, 
> hbase-7006-combined-v6.patch, hbase-7006-combined-v7.patch, 
> hbase-7006-combined-v8.patch, hbase-7006-combined-v9.patch, LogSplitting 
> Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to