[jira] [Updated] (HBASE-5783) Faster HBase bulk loader

Karthik Ranganathan (JIRA) Tue, 18 Sep 2012 10:52:11 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-5783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Karthik Ranganathan updated HBASE-5783:
---------------------------------------

    Assignee: Amitanand Aiyer  (was: Nicolas Spiegelberg)
    
> Faster HBase bulk loader
> ------------------------
>
>                 Key: HBASE-5783
>                 URL: https://issues.apache.org/jira/browse/HBASE-5783
>             Project: HBase
>          Issue Type: New Feature
>          Components: client, ipc, performance, regionserver
>            Reporter: Karthik Ranganathan
>            Assignee: Amitanand Aiyer
>
> We can get a 3x to 4x gain based on a prototype demonstrating this approach 
> in effect (hackily) over the MR bulk loader for very large data sets by doing 
> the following:
> 1. Do direct multi-puts from HBase client using GZIP compressed RPC's
> 2. Turn off WAL (we will ensure no data loss in another way)
> 3. For each bulk load client, we need to:
> 3.1 do a put
> 3.2 get back a tracking cookie (memstoreTs or HLogSequenceId) per put
> 3.3 be able to ask the RS if the tracking cookie has been flushed to disk
> 4. For each client, we can succeed it if the tracking cookie for the last put 
> it did (for every RS) makes it to disk. Otherwise the map task fails and is 
> retried.
> 5. If the last put did not make it to disk for a timeout (say a second or so) 
> we issue a manual flush.
> Enhancements:
> - Increase the memstore size so that we flush larger files
> - Decrease the compaction ratios (say increase the number of files to compact)
> Quick background:
> The bottlenecks in the multiput approach are that the data is transferred 
> *uncompressed* twice over the top-of-rack: once from the client to the RS (on 
> the multi put call) and again because of WAL (HDFS replication). We reduced 
> the former with RPC compression and eliminated the latter above while still 
> guaranteeing that data wont be lost.
> This is better than the MR bulk loader at a high level because we dont need 
> to merge sort all the files for a given region and then make it a HFile - 
> thats the equivalent of bulk loading AND majorcompacting in one shot. Also 
> there is much more disk involved in the MR method (sort/spill).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5783) Faster HBase bulk loader

Reply via email to