[jira] [Commented] (HBASE-7006) [MTTR] Improve Region Server Recovery Time - Distributed Log Replay

Himanshu Vashishtha (JIRA) Thu, 06 Jun 2013 12:24:07 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13677402#comment-13677402
 ]


Himanshu Vashishtha commented on HBASE-7006:
--------------------------------------------

Thanks for the above discussion. I have some follow up questions on Option 2/3:

1. If I am reading it correct, Option 2/3 are preserving the sequenceId of the 
old wal file? Does that mean the WAL edit created at the new RS for this entry 
would have old sequenceId? Or something else?
When a region is opened, it reads the max sequence Ids from its StoreFiles and 
sets the FSHlog counter to it (if the counter is at some lower value). If we 
are keeping the original sequence IDs, a WAL file could have a random 
distribution of sequenceIds (would not be tightly ascending as we have it now). 
Could there be any gotcha here? Such as handling chain fail-over.

2. Another question is, initially we had one recovered.edits file per WAL; now 
we planning one HFile per WAL.
Looking at this, saving on number of I/O (and NN ops) is not that much IMHO as 
it is the same number of files as such? With larger number of small files, it 
could lead to more compaction. Though stripe compaction could help, but that's 
a different thing (and I haven't looked at the compaction code).  Bucketing 
WALs is definitely better.

                
> [MTTR] Improve Region Server Recovery Time - Distributed Log Replay
> -------------------------------------------------------------------
>
>                 Key: HBASE-7006
>                 URL: https://issues.apache.org/jira/browse/HBASE-7006
>             Project: HBase
>          Issue Type: New Feature
>          Components: MTTR
>            Reporter: stack
>            Assignee: Jeffrey Zhong
>            Priority: Critical
>             Fix For: 0.98.0, 0.95.1
>
>         Attachments: 7006-addendum-3.txt, hbase-7006-addendum.patch, 
> hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v4.patch, hbase-7006-combined-v5.patch, 
> hbase-7006-combined-v6.patch, hbase-7006-combined-v7.patch, 
> hbase-7006-combined-v8.patch, hbase-7006-combined-v9.patch, LogSplitting 
> Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-7006) [MTTR] Improve Region Server Recovery Time - Distributed Log Replay

Reply via email to