[
https://issues.apache.org/jira/browse/HBASE-13042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
churro morales resolved HBASE-13042.
------------------------------------
Resolution: Fixed
> MR Job to export HFiles directly from an online cluster
> -------------------------------------------------------
>
> Key: HBASE-13042
> URL: https://issues.apache.org/jira/browse/HBASE-13042
> Project: HBase
> Issue Type: New Feature
> Reporter: Dave Latham
>
> We're looking at the best way to bootstrap a new remote cluster. The source
> cluster has a a large table of compressed data using more than 50% of the
> HDFS capacity and we have a WAN link to the remote cluster. Ideally we would
> set up replication to a new table remotely, snapshot the source table, copy
> the snapshot across, then bulk load it into the new table. However the
> amount of time to copy the data remotely is greater than the major compaction
> interval so the source cluster would run out of storage.
> One approach is HBASE-13031 to allow the operators to snapshot and copy one
> key range at a time. Here's another idea:
> Create a MR job that tries to do a robust remote HFile copy directly:
> * Each split is responsible for a key range.
> * Map task lookups up that key range and maps it to a set of HDFS store
> directories (one for each region/family)
> * For each store:
> ** List HFiles in store (needs to be less than 1000 files to guarantee
> atomic listing)
> ** Attempt to copy store files (copy in increasing size order to minimize
> likelihood of compaction removing a file during copy)
> ** If some of the files disappear (compaction), retry directory list / copy
> * If any of the stores disappear (region split / merge) then retry map task
> (and remap key range to stores)
> Or maybe there are some HBase locking mechanisms for a region or store that
> would be better. Otherwise the question is how often would compactions or
> region splits force retries.
> Is this crazy?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)