thanks for the response guys, I've created a ticket for this issue and we
can start to take it from there.

https://issues.apache.org/jira/browse/HBASE-13031

maybe we can hash out how to handle the splitting / merging logic and I can
start working on some patches.





On Thu, Feb 12, 2015 at 9:48 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> bq. allow keys to be munged to the nearest region boundary.
>
> Interesting idea.
>
> If there is region merge, this may get a bit complicated.
>
> On Thu, Feb 12, 2015 at 9:26 AM, Jesse Yates <jesse.k.ya...@gmail.com>
> wrote:
>
> > Not a crazy idea at all :)
> >
> > It becomes very tractable if you are willing to allow keys to be munged
> to
> > the nearest region boundary. The snapshot only considers the HFiles in
> each
> > region and creates links to those files for the snapshot. So just
> capturing
> > a subset of regions (as dictated by the 'hint' key ranges) would be
> > reasonable.
> >
> > We might need a way to differentiate them from normal snapshots, but
> maybe
> > not - if you supply key ranges, then its on you to know what you are
> doing
> > with that snapshot.
> >
> > Would you ever want to restore only part of a table? Im not sure that
> even
> > makes sense.... maybe restoring a chunk at a time? If the latter, then we
> > will likely need to change the restore mechanics to make sure it works
> (but
> > it may just work out the box, IIRC).
> >
> > we could do the process in batches
> >
> >
> > Would you be willing to manage that your self or would you see this as
> > something HBase would manage for you?
> >
> > -------------------
> > Jesse Yates
> > @jesse_yates
> > jyates.github.com
> >
> > On Thu, Feb 12, 2015 at 9:18 AM, rahul gidwani <rahul.gidw...@gmail.com>
> > wrote:
> >
> > > Before proposing this idea, I would like to state I have recently had a
> > > through psychiatric evaluation and I'm not crazy.
> > >
> > > We here at flurry land have some very large tables on the order of 1PB,
> > 3PB
> > > with dfs replication.  We wanted to ship this table to another cluster
> > > using snapshots.  Problem is that the data will take weeks to ship and
> > > during that time major compaction will happen and we will end up with
> > > potentially double the data on our cluster.  (We really don't want to
> > turn
> > > off major compaction because we will really suffer with reads).
> > >
> > > Additionally there is one really large CF that dominates this table.
> So
> > to
> > > mitigate this problem we were thinking that a user could pass in the
> key
> > > ranges for a snapshot and we could do the process in batches.  This
> might
> > > also be useful for sampling data, or keys which are based on something
> > like
> > > timestamps, where you could archive certain portions of data known to
> be
> > > stale.
> > >
> > > If people are interested we could get into more details about
> > > implementation.
> > >
> > > Cheers
> > > rahul
> > >
> >
>

Reply via email to