[jira] [Commented] (HBASE-7006) [MTTR] Improve Region Server Recovery Time - Distributed Log Replay

Jeffrey Zhong (JIRA) Thu, 06 Jun 2013 13:17:08 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13677449#comment-13677449
 ]


Jeffrey Zhong commented on HBASE-7006:
--------------------------------------

I created HBASE-8701 and linked it to this JIRA to address the concerns from 
Elliot and Himanshu.

{quote}
When a region is opened, it reads the max sequence Ids from its StoreFiles and 
sets the FSHlog counter to it (if the counter is at some lower value). If we 
are keeping the original sequence IDs, a WAL file could have a random 
distribution of sequenceIds (would not be tightly ascending as we have it now). 
Could there be any gotcha here? Such as handling chain fail-over.
{quote}
We have to use skip wal option here.

{quote}
Another question is, initially we had one recovered.edits file per WAL; now we 
planning one HFile per WAL
{quote}
This is a good question. The benefits are still no recovered.edits related IOs 
and allow writes during recovery. Currently we already created many hfiles 
because we flush after each recovered edits replay. I'm planning to use a 
config to control the new behavior because the issue we're trying to address 
isn't a common usage scenario. Later we can introduce bucketing for 
optimization this part.
                
> [MTTR] Improve Region Server Recovery Time - Distributed Log Replay
> -------------------------------------------------------------------
>
>                 Key: HBASE-7006
>                 URL: https://issues.apache.org/jira/browse/HBASE-7006
>             Project: HBase
>          Issue Type: New Feature
>          Components: MTTR
>            Reporter: stack
>            Assignee: Jeffrey Zhong
>            Priority: Critical
>             Fix For: 0.98.0, 0.95.1
>
>         Attachments: 7006-addendum-3.txt, hbase-7006-addendum.patch, 
> hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v4.patch, hbase-7006-combined-v5.patch, 
> hbase-7006-combined-v6.patch, hbase-7006-combined-v7.patch, 
> hbase-7006-combined-v8.patch, hbase-7006-combined-v9.patch, LogSplitting 
> Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-7006) [MTTR] Improve Region Server Recovery Time - Distributed Log Replay

Reply via email to