bq. allow keys to be munged to the nearest region boundary. Interesting idea.
If there is region merge, this may get a bit complicated. On Thu, Feb 12, 2015 at 9:26 AM, Jesse Yates <jesse.k.ya...@gmail.com> wrote: > Not a crazy idea at all :) > > It becomes very tractable if you are willing to allow keys to be munged to > the nearest region boundary. The snapshot only considers the HFiles in each > region and creates links to those files for the snapshot. So just capturing > a subset of regions (as dictated by the 'hint' key ranges) would be > reasonable. > > We might need a way to differentiate them from normal snapshots, but maybe > not - if you supply key ranges, then its on you to know what you are doing > with that snapshot. > > Would you ever want to restore only part of a table? Im not sure that even > makes sense.... maybe restoring a chunk at a time? If the latter, then we > will likely need to change the restore mechanics to make sure it works (but > it may just work out the box, IIRC). > > we could do the process in batches > > > Would you be willing to manage that your self or would you see this as > something HBase would manage for you? > > ------------------- > Jesse Yates > @jesse_yates > jyates.github.com > > On Thu, Feb 12, 2015 at 9:18 AM, rahul gidwani <rahul.gidw...@gmail.com> > wrote: > > > Before proposing this idea, I would like to state I have recently had a > > through psychiatric evaluation and I'm not crazy. > > > > We here at flurry land have some very large tables on the order of 1PB, > 3PB > > with dfs replication. We wanted to ship this table to another cluster > > using snapshots. Problem is that the data will take weeks to ship and > > during that time major compaction will happen and we will end up with > > potentially double the data on our cluster. (We really don't want to > turn > > off major compaction because we will really suffer with reads). > > > > Additionally there is one really large CF that dominates this table. So > to > > mitigate this problem we were thinking that a user could pass in the key > > ranges for a snapshot and we could do the process in batches. This might > > also be useful for sampling data, or keys which are based on something > like > > timestamps, where you could archive certain portions of data known to be > > stale. > > > > If people are interested we could get into more details about > > implementation. > > > > Cheers > > rahul > > >