[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13676513#comment-13676513
 ] 

stack commented on HBASE-7006:
------------------------------

On option two, if WALs are being replayed without order, couldn't an edit from 
WAL 1 (an old WAL) overwrite an edit from WAL 3 (a newer WAL) because memstore 
does not consider sequenceid?

I do not think option three will work.  We want to be able to put in place 
multiple WALs per server in the near future and in this case the sequenceids 
will be spread about amongst a few logs (probably two is enough).  Since the 
sequenceids will be spread across N WALs, splitlogworker will not be able to 
deduce WAL order since some WALs will be contemporaneous having been written to 
in // (In other words, replay is bringing on sooner a problem we are going to 
need to solve anyways).

In Option three, how will you bucket WALs?  You will need to pass in the the 
WAL file name when you do the Put?  How will you signal the regionserver the 
WAL is done?  A special edit?

On replay, do you need a memstore that considers sequenceid such that when two 
edits w/ same coordinate, the one w/ the latest sequenceid is retained rather 
than the last written?

What is the worst case if we could not flush until all WALs replayed?

Lets say 2k regions on two servers?  That means one server will need to take 
all edits from 1k regions?   Lets say there were 256k WALs?  At 128M per WAL 
that is 32G of edits we'd have to keep in memory w/o flushing?  If were also 
taking writes for all 2k regions, that would be extra memory pressure.  We'd 
fall over in this case.

Could the replay tell the RS it was replaying a single WAL and when it was 
done?  For WAL it could pass the sequence ids and a hash of the WAL path.  Not 
sure how it would flag the replay is done since in distributed split, a RS 
could be taking on multiple WAL edits at a time... (so can not treat the 
arrival of a new WAL file hash as meaning we are done w/ the old file).  Region 
server could take on the edits into a special single-WAL memstore.  Region 
server could keep taking on edits from WALs and keep them in memory until it 
hit memory barrier.  We could then flush these per WAL memstores as hfiles w/ 
their sequence ids.  If the flush didn't get all of a WAL, that should be fine. 
 Would be lots of hfiles possibly but having to flush would be rare I'd say (RS 
w/ 1k regions and 256 WALs would be rare).


                
> [MTTR] Improve Region Server Recovery Time - Distributed Log Replay
> -------------------------------------------------------------------
>
>                 Key: HBASE-7006
>                 URL: https://issues.apache.org/jira/browse/HBASE-7006
>             Project: HBase
>          Issue Type: New Feature
>          Components: MTTR
>            Reporter: stack
>            Assignee: Jeffrey Zhong
>            Priority: Critical
>             Fix For: 0.98.0, 0.95.1
>
>         Attachments: 7006-addendum-3.txt, hbase-7006-addendum.patch, 
> hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v4.patch, hbase-7006-combined-v5.patch, 
> hbase-7006-combined-v6.patch, hbase-7006-combined-v7.patch, 
> hbase-7006-combined-v8.patch, hbase-7006-combined-v9.patch, LogSplitting 
> Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to