I do not think there is a need for new API. Take a look at TableSnapshotInputFormat which you can customize to to work with key ranges. It allows M/R over snapshots.
You make a snapshot of a full table, then you run first batch of keys in M/R job then you delete snapshot and create new one ... repeat until last key range. You will need to control major compaction during this migration. How to output the data - is your choice: TableOutputFormat or HFileOutputFormat2 -Vladimir Rodionov On Thu, Feb 12, 2015 at 9:18 AM, rahul gidwani <rahul.gidw...@gmail.com> wrote: > Before proposing this idea, I would like to state I have recently had a > through psychiatric evaluation and I'm not crazy. > > We here at flurry land have some very large tables on the order of 1PB, 3PB > with dfs replication. We wanted to ship this table to another cluster > using snapshots. Problem is that the data will take weeks to ship and > during that time major compaction will happen and we will end up with > potentially double the data on our cluster. (We really don't want to turn > off major compaction because we will really suffer with reads). > > Additionally there is one really large CF that dominates this table. So to > mitigate this problem we were thinking that a user could pass in the key > ranges for a snapshot and we could do the process in batches. This might > also be useful for sampling data, or keys which are based on something like > timestamps, where you could archive certain portions of data known to be > stale. > > If people are interested we could get into more details about > implementation. > > Cheers > rahul >