[jira] Updated: (HBASE-1008) [performance] The replay of logs on server crash takes way too long

stack (JIRA) Thu, 20 Nov 2008 14:55:26 -0800

     [ 
https://issues.apache.org/jira/browse/HBASE-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


stack updated HBASE-1008:
-------------------------

    Fix Version/s:     (was: 0.19.0)
                   0.20.0

Looking at this, logging needs to be rethought. In bigtable paper, the split is 
distributed. If we're going to have 1000 logs, we need to distribute or at 
least multithread the splitting.

1. As is, regions starting up expect to find one reconstruction log only.  Need 
to make it so pick up a bunch of edit logs and it should be fine that logs are 
elsewhere in hdfs in an output directory written by all split participants 
whether multithreaded or a mapreduce-like distributed process (Lets write our 
distributed sort first as a MR so we learn whats involved; distributed sort, as 
much as possible should use MR framework pieces).  On startup, regions go to 
this directory and pick up the files written by split participants deleting and 
clearing the dir when all have been read in.  Making it so can take multiple 
logs for input, can also make the split process more robust rather than current 
tenuous process which loses all edits if it doesn't make it to the end without 
error.
2. Each column family rereads the reconstruction log to find its edits.  Need 
to fix that.  Split can sort the edits by column family so store only reads its 
edits.

Too much work involved here to make it into 0.19.  Moving it out.

> [performance] The replay of logs on server crash takes way too long
> -------------------------------------------------------------------
>
>                 Key: HBASE-1008
>                 URL: https://issues.apache.org/jira/browse/HBASE-1008
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Priority: Critical
>             Fix For: 0.20.0
>
>
> Watching recovery from a crash on streamy.com where there were 1048 logs and 
> repay is running at rate of about 20 seconds each.  Meantime these regions 
> are not online.  This is way too long to wait on recovery for a live site.  
> Marking critical.  Performance related so priority and in 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1008) [performance] The replay of logs on server crash takes way too long

Reply via email to