[jira] [Commented] (HBASE-5604) HLog replay tool that generates HFiles for use by LoadIncrementalHFiles.

Jonathan Hsieh (Commented) (JIRA) Thu, 22 Mar 2012 00:32:49 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-5604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235420#comment-13235420
 ]


Jonathan Hsieh commented on HBASE-5604:
---------------------------------------


What is the use case for filtering the HLogs?  Would you want to do a "partial" 
recovery?

I've encountered a situation where an entire large cluster went out and every 
RS's WAL needed to do log splitting.  The nn went down under hbase.  Since this 
was before distributed log splitting it took an overnight to restore.  
Distributed log splitting would have really helped here (roughly divide by 100) 
but its not clear if there was enough data to make the bulk load overcome the 
extra writes required with a MR job.

I'd guess that distributed log splitting is probably faster -- with an MR job 
you'd potentially need to materialize after map, do a shuffle (needed?), and 
materialize again after reduce before bulk loading (which may split the 
generated hfiles) (multiple writes per put/delete).  Distributed log splitting, 
assuming there is no WAL writes on replay may not incur disk cost except for 
regular memstore flushes (which means single write per put/delete).  
                
> HLog replay tool that generates HFiles for use by LoadIncrementalHFiles.
> ------------------------------------------------------------------------
>
>                 Key: HBASE-5604
>                 URL: https://issues.apache.org/jira/browse/HBASE-5604
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: Lars Hofhansl
>
> Just an idea I had. Might be useful for restore of a backup using the HLogs.
> This could an M/R (with a mapper per HLog file).
> The tool would get a timerange and a (set of) table(s). We'd pick the right 
> HLogs based on time before the M/R job is started and then have a mapper per 
> HLog file.
> The mapper would then go through the HLog, filter all WALEdits that didn't 
> fit into the time range or are not any of the tables and then uses 
> HFileOutputFormat to generate HFiles.
> Would need to indicate the splits we want, probably from a live table.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5604) HLog replay tool that generates HFiles for use by LoadIncrementalHFiles.

Reply via email to