thanks for the response guys, I've created a ticket for this issue and we can start to take it from there.
https://issues.apache.org/jira/browse/HBASE-13031 maybe we can hash out how to handle the splitting / merging logic and I can start working on some patches. On Thu, Feb 12, 2015 at 9:48 AM, Ted Yu <yuzhih...@gmail.com> wrote: > bq. allow keys to be munged to the nearest region boundary. > > Interesting idea. > > If there is region merge, this may get a bit complicated. > > On Thu, Feb 12, 2015 at 9:26 AM, Jesse Yates <jesse.k.ya...@gmail.com> > wrote: > > > Not a crazy idea at all :) > > > > It becomes very tractable if you are willing to allow keys to be munged > to > > the nearest region boundary. The snapshot only considers the HFiles in > each > > region and creates links to those files for the snapshot. So just > capturing > > a subset of regions (as dictated by the 'hint' key ranges) would be > > reasonable. > > > > We might need a way to differentiate them from normal snapshots, but > maybe > > not - if you supply key ranges, then its on you to know what you are > doing > > with that snapshot. > > > > Would you ever want to restore only part of a table? Im not sure that > even > > makes sense.... maybe restoring a chunk at a time? If the latter, then we > > will likely need to change the restore mechanics to make sure it works > (but > > it may just work out the box, IIRC). > > > > we could do the process in batches > > > > > > Would you be willing to manage that your self or would you see this as > > something HBase would manage for you? > > > > ------------------- > > Jesse Yates > > @jesse_yates > > jyates.github.com > > > > On Thu, Feb 12, 2015 at 9:18 AM, rahul gidwani <rahul.gidw...@gmail.com> > > wrote: > > > > > Before proposing this idea, I would like to state I have recently had a > > > through psychiatric evaluation and I'm not crazy. > > > > > > We here at flurry land have some very large tables on the order of 1PB, > > 3PB > > > with dfs replication. We wanted to ship this table to another cluster > > > using snapshots. Problem is that the data will take weeks to ship and > > > during that time major compaction will happen and we will end up with > > > potentially double the data on our cluster. (We really don't want to > > turn > > > off major compaction because we will really suffer with reads). > > > > > > Additionally there is one really large CF that dominates this table. > So > > to > > > mitigate this problem we were thinking that a user could pass in the > key > > > ranges for a snapshot and we could do the process in batches. This > might > > > also be useful for sampling data, or keys which are based on something > > like > > > timestamps, where you could archive certain portions of data known to > be > > > stale. > > > > > > If people are interested we could get into more details about > > > implementation. > > > > > > Cheers > > > rahul > > > > > >