I do not think there is a need for new API.

Take a look at TableSnapshotInputFormat which you can customize to to work
with key ranges.
It allows M/R over snapshots.

You make a snapshot of a full table, then you run  first batch of keys in
M/R job
then you delete snapshot and create new one ...
repeat until last key range.

You will need to control major compaction during this migration.

How to output the data - is your choice: TableOutputFormat or
HFileOutputFormat2

-Vladimir Rodionov

On Thu, Feb 12, 2015 at 9:18 AM, rahul gidwani <rahul.gidw...@gmail.com>
wrote:

> Before proposing this idea, I would like to state I have recently had a
> through psychiatric evaluation and I'm not crazy.
>
> We here at flurry land have some very large tables on the order of 1PB, 3PB
> with dfs replication.  We wanted to ship this table to another cluster
> using snapshots.  Problem is that the data will take weeks to ship and
> during that time major compaction will happen and we will end up with
> potentially double the data on our cluster.  (We really don't want to turn
> off major compaction because we will really suffer with reads).
>
> Additionally there is one really large CF that dominates this table.  So to
> mitigate this problem we were thinking that a user could pass in the key
> ranges for a snapshot and we could do the process in batches.  This might
> also be useful for sampling data, or keys which are based on something like
> timestamps, where you could archive certain portions of data known to be
> stale.
>
> If people are interested we could get into more details about
> implementation.
>
> Cheers
> rahul
>

Reply via email to