[
https://issues.apache.org/jira/browse/HBASE-3936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lars Hofhansl updated HBASE-3936:
---------------------------------
Fix Version/s: (was: 0.94.0)
0.96.0
> Incremental bulk load support for Increments
> --------------------------------------------
>
> Key: HBASE-3936
> URL: https://issues.apache.org/jira/browse/HBASE-3936
> Project: HBase
> Issue Type: Improvement
> Reporter: Andrew Purtell
> Assignee: Andrew Purtell
> Fix For: 0.96.0
>
>
> From http://hbase.apache.org/bulk-loads.html: "The bulk load feature uses a
> MapReduce job to output table data in HBase's internal data format, and then
> directly loads the data files into a running cluster. Using bulk load will
> use less CPU and network than going via the HBase API."
> I have been working with a specific implementation of, and can envision, a
> class of applications that reduce data into a large collection of counters,
> perhaps building projections of the data in many dimensions in the process.
> One can use Hadoop MapReduce as the engine to accomplish this for a given
> data set and use LoadIncrementalHFiles to move the result into place for live
> serving. MR is natural for summation over very large counter sets: emit
> counter increments for the data set and projections thereof in mappers, use
> combiners for partial aggregation, use reducers to do final summation into
> HFiles.
> However, it is not possible to then merge in a set of updates to an existing
> table built in the manner above without either 1) joining the table data and
> the update set into a large MR temporary set, followed by a complete rewrite
> of the table; or 2) posting all of the updates as Increments via the HBase
> API, impacting any other concurrent users of the HBase service, and perhaps
> taking 10-100 times longer than if updates could be computed directly into
> HFiles like the original import. Both of these alternatives are expensive in
> terms of CPU and time; one is also expensive in terms of disk.
> I propose adding incremental bulk load support for Increments. Here is a
> sketch of a possible implementation:
> * Add a KV type for Increment
> * Modify HFile main, LoadIncrementalHFiles, and others that work with HFiles
> directly to handle the new KV type
> * Bulk load API can move the files to be merged into the Stores as before.
> * Implement an alternate compaction algorithm or modify the existing. Need to
> identify Increments and apply them to an existing most recent version of a
> value, or create the value if it does not exist.
> ** Use KeyValueHeap as is to merge value-sets by row as before.
> ** For each row, use a KV-keyed Map for in memory update of values.
> ** If there is an existing value and it is not a serialized long, ignore
> the Increment and log at INFO level.
> ** Use the persistent HashMapWrapper from Hive's CommonJoinOperator, with
> an appropriate memory limit, so work for overlarge rows will spill to disk.
> Can be local disk, not HDFS.
> * Never return an Increment KV to a client doing a Get or Scan.
> ** Before the merge is complete, if we find an Increment KV when searching
> Store files for a value, continue searching back in the Store files until we
> find a Put KV for the value, adding up Increments as they are encountered,
> then applying them to the Put value; or until search ends, in which case the
> Increment is treated as a Put.
> ** If there is an existing value and it is not a serialized long, ignore
> the Increment and log at INFO level.
> * As a beneficial side effect, with Increments as just another KV type we can
> unify Put and Increment handling.
> Because this is a core concern I'd prefer discussing this as a possible
> enhancement of core as opposed to a Coprocessor-based extension. However it
> could be possible to implement all but the KV changes within the Coprocessor
> framework.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira