[
https://issues.apache.org/jira/browse/HBASE-13042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14320618#comment-14320618
]
Enis Soztutar commented on HBASE-13042:
---------------------------------------
Here is an idea. Not sure it will help you or not.
{{TableSnapshotInputFormat}} allows you to run any MR job directly over the
snapshot. It also accepts key ranges, and eliminates regions out of the range.
Without HBASE-13031 a snapshot is still a full table snapshot, but what you can
do is:
Decide on table ranges (lets say N ranges)
{code}
for i in 0..N
(1) take snapshot
(2) use custom MR job to export the data (create hfiles for bulk load) over
the snapshot for Range[i]
(3) delete the snapshot
{code}
You will only hold onto the single snapshot during (2), which you can control
for how long it will take depending on the size of Range[i].
> MR Job to export HFiles directly from an online cluster
> -------------------------------------------------------
>
> Key: HBASE-13042
> URL: https://issues.apache.org/jira/browse/HBASE-13042
> Project: HBase
> Issue Type: New Feature
> Reporter: Dave Latham
>
> We're looking at the best way to bootstrap a new remote cluster. The source
> cluster has a a large table of compressed data using more than 50% of the
> HDFS capacity and we have a WAN link to the remote cluster. Ideally we would
> set up replication to a new table remotely, snapshot the source table, copy
> the snapshot across, then bulk load it into the new table. However the
> amount of time to copy the data remotely is greater than the major compaction
> interval so the source cluster would run out of storage.
> One approach is HBASE-13031 to allow the operators to snapshot and copy one
> key range at a time. Here's another idea:
> Create a MR job that tries to do a robust remote HFile copy directly:
> * Each split is responsible for a key range.
> * Map task lookups up that key range and maps it to a set of HDFS store
> directories (one for each region/family)
> * For each store:
> ** List HFiles in store (needs to be less than 1000 files to guarantee
> atomic listing)
> ** Attempt to copy store files (copy in increasing size order to minimize
> likelihood of compaction removing a file during copy)
> ** If some of the files disappear (compaction), retry directory list / copy
> * If any of the stores disappear (region split / merge) then retry map task
> (and remap key range to stores)
> Or maybe there are some HBase locking mechanisms for a region or store that
> would be better. Otherwise the question is how often would compactions or
> region splits force retries.
> Is this crazy?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)