Before proposing this idea, I would like to state I have recently had a
through psychiatric evaluation and I'm not crazy.

We here at flurry land have some very large tables on the order of 1PB, 3PB
with dfs replication.  We wanted to ship this table to another cluster
using snapshots.  Problem is that the data will take weeks to ship and
during that time major compaction will happen and we will end up with
potentially double the data on our cluster.  (We really don't want to turn
off major compaction because we will really suffer with reads).

Additionally there is one really large CF that dominates this table.  So to
mitigate this problem we were thinking that a user could pass in the key
ranges for a snapshot and we could do the process in batches.  This might
also be useful for sampling data, or keys which are based on something like
timestamps, where you could archive certain portions of data known to be
stale.

If people are interested we could get into more details about
implementation.

Cheers
rahul

Reply via email to