[
https://issues.apache.org/jira/browse/HBASE-5783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Karthik Ranganathan updated HBASE-5783:
---------------------------------------
Assignee: Amitanand Aiyer (was: Nicolas Spiegelberg)
> Faster HBase bulk loader
> ------------------------
>
> Key: HBASE-5783
> URL: https://issues.apache.org/jira/browse/HBASE-5783
> Project: HBase
> Issue Type: New Feature
> Components: client, ipc, performance, regionserver
> Reporter: Karthik Ranganathan
> Assignee: Amitanand Aiyer
>
> We can get a 3x to 4x gain based on a prototype demonstrating this approach
> in effect (hackily) over the MR bulk loader for very large data sets by doing
> the following:
> 1. Do direct multi-puts from HBase client using GZIP compressed RPC's
> 2. Turn off WAL (we will ensure no data loss in another way)
> 3. For each bulk load client, we need to:
> 3.1 do a put
> 3.2 get back a tracking cookie (memstoreTs or HLogSequenceId) per put
> 3.3 be able to ask the RS if the tracking cookie has been flushed to disk
> 4. For each client, we can succeed it if the tracking cookie for the last put
> it did (for every RS) makes it to disk. Otherwise the map task fails and is
> retried.
> 5. If the last put did not make it to disk for a timeout (say a second or so)
> we issue a manual flush.
> Enhancements:
> - Increase the memstore size so that we flush larger files
> - Decrease the compaction ratios (say increase the number of files to compact)
> Quick background:
> The bottlenecks in the multiput approach are that the data is transferred
> *uncompressed* twice over the top-of-rack: once from the client to the RS (on
> the multi put call) and again because of WAL (HDFS replication). We reduced
> the former with RPC compression and eliminated the latter above while still
> guaranteeing that data wont be lost.
> This is better than the MR bulk loader at a high level because we dont need
> to merge sort all the files for a given region and then make it a HFile -
> thats the equivalent of bulk loading AND majorcompacting in one shot. Also
> there is much more disk involved in the MR method (sort/spill).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira