[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13676549#comment-13676549
 ] 

Jeffrey Zhong commented on HBASE-7006:
--------------------------------------

{quote}
On option two, if WALs are being replayed without order, couldn't an edit from 
WAL 1 (an old WAL) overwrite an edit from WAL 3 (a newer WAL) because memstore 
does not consider sequenceid?
{quote}
You're right. Option2 has to consider original wal sequenceId as well.

{quote}
In Option three, how will you bucket WALs? You will need to pass in the the WAL 
file name when you do the Put? How will you signal the regionserver the WAL is 
done? A special edit?
{quote}
I need to pass WAL file name(or its hash) inside each replay batch. Receiving 
RS can put watcher on the split log file deletion/done ZK events and flush a 
bucket memstore when all log files of the bucket are recovered. Bucket logic is 
controlled by receiving RS and configuration. Since all the info are in ZK so 
receiving RS can determine which files belong to which bucket.

{quote}
Lets say 2k regions on two servers? That means one server will need to take all 
edits from 1k regions? Lets say there were 256k WALs? At 128M per WAL that is 
32G of edits we'd have to keep in memory w/o flushing? If were also taking 
writes for all 2k regions, that would be extra memory pressure. We'd fall over 
in this case.
{quote}
I guess the 2k regions on two servers is a long term goal for us. We could add 
a memory limit for all memstore opened for replay. If the limit is exceeded, 
the receiving RS rejects replays. In addition it could also pick a memstore and 
resign work items for the store and tell the playing RS to reassign the work 
item(wal file).

I like your single-WAL memstore flush approach(a special case with number of 
wal per bucket=1). This way keeps memory management & flush simpler while at 
cost of more IOs. We could implement this at the beginning. Possibly group more 
log files for flush depending on how multiple WAL implementation goes.

Thanks! 





 

                
> [MTTR] Improve Region Server Recovery Time - Distributed Log Replay
> -------------------------------------------------------------------
>
>                 Key: HBASE-7006
>                 URL: https://issues.apache.org/jira/browse/HBASE-7006
>             Project: HBase
>          Issue Type: New Feature
>          Components: MTTR
>            Reporter: stack
>            Assignee: Jeffrey Zhong
>            Priority: Critical
>             Fix For: 0.98.0, 0.95.1
>
>         Attachments: 7006-addendum-3.txt, hbase-7006-addendum.patch, 
> hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v4.patch, hbase-7006-combined-v5.patch, 
> hbase-7006-combined-v6.patch, hbase-7006-combined-v7.patch, 
> hbase-7006-combined-v8.patch, hbase-7006-combined-v9.patch, LogSplitting 
> Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to