[jira] [Commented] (HBASE-7006) [MTTR] Improve Region Server Recovery Time - Distributed Log Replay

stack (JIRA) Thu, 06 Jun 2013 15:11:10 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13677582#comment-13677582
 ]


stack commented on HBASE-7006:
------------------------------

bq. We have to use skip wal option here.

I was hoping to avoid our doing skip-wal for reasons argued above, that 
replaying edits w/ skip WAL enabled introduces more states and will complicate 
replay but old edits coming into the new server getting new seqids will itself 
make for some new interesting states (If the server we are playing into crashes 
before all is flushed, it will have in its WALs edits where the sequenceid for 
'B', is > that for 'C', so on its recovery, 'B', will come out when we want 
'C', the last edit inserted at a particular coordinate).

So, if no WAL, what happens when we need to flush a memstore or a background 
replay memstore (the one-memstore-per-region we discuss above)?  What seqid 
will we write out into the hfile if we have to flush memory?  I suppose if this 
replay backing memstore had the old WAL seqid, it would be legit to use these.  
The flushed file would sort properly with an old seqid (but then this would be 
a different kind of flush, one where you dictate the seqid file rather than 
take what is current in the server -- that will be intrusive to change).

We'd have to use the old ids in case we had to flush midway through a WAL (I 
suppose we say this already above)

But thinking more on the per-WAL replay memstore, there are kinks to figure 
(apart from the one above where we want to have a flush w/ a seqid that is not 
the servers current max seqid).  As hfiles contain sorted kvs but the edits in 
the old WAL not in sort order, if we sort the edits so we can flush the hfile, 
then we'll have seqids not-in-order.  Do we take the highest seqid in the hfile 
as the hfiles' seqid?  This would be different to how we usually write hfiles.  
There could be issues in here.

bq. Another question is, initially we had one recovered.edits file per WAL; now 
we planning one HFile per WAL.

This would be only if we had to flush.  We'd keep per-WAL replay memstore so if 
we have to flush, the file written out  -- this would be at an extreme.


bq.  I'm planning to use a config to control the new behavior because the issue 
we're trying to address isn't a common usage scenario.
bq.  I'd vote we instead have a config that would disallow writes during 
recovery

+1 on disabling writes during recovery for now.  It is this that is adding the 
complication.  If we disable writes during recovery, we can turn on distributed 
log replay now as the default and enjoy the speedup it brings over current log 
splitting.  We can work on being able to take on writes during recovery for 
later and over in the new issue.
                
> [MTTR] Improve Region Server Recovery Time - Distributed Log Replay
> -------------------------------------------------------------------
>
>                 Key: HBASE-7006
>                 URL: https://issues.apache.org/jira/browse/HBASE-7006
>             Project: HBase
>          Issue Type: New Feature
>          Components: MTTR
>            Reporter: stack
>            Assignee: Jeffrey Zhong
>            Priority: Critical
>             Fix For: 0.98.0, 0.95.1
>
>         Attachments: 7006-addendum-3.txt, hbase-7006-addendum.patch, 
> hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v4.patch, hbase-7006-combined-v5.patch, 
> hbase-7006-combined-v6.patch, hbase-7006-combined-v7.patch, 
> hbase-7006-combined-v8.patch, hbase-7006-combined-v9.patch, LogSplitting 
> Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-7006) [MTTR] Improve Region Server Recovery Time - Distributed Log Replay

Reply via email to