A different top level command would be a better approach (even though the implementation can sure much of the scan spec parsing code.) OTOH, DUMP would be a better name though as "BACKUP" without a file (just dump to stdout) would sound strange. Plus, it's shorter :)
On Mon, Jan 18, 2010 at 10:42 PM, Doug Judd <[email protected]> wrote: > The BACKUP feature is really to allow for the generation of efficient backup > files. Certain WHERE clauses and options such as ROW, CELL, and LIMIT would > be incompatible with the BACKUP option since BACKUP would be a completely > separate code path and those other options don't really jibe with the > concept of backing up a table. The reason that I suggest folding it in with > SELECT is because some of the other options, such as TIMESTAMP, column > selection, and REVS, could be useful features of table backup. > > The other approach would be to add a toplevel BACKUP TABLE command that > would support a subset of SELECT options that would be appropriate for table > backups. > > BACKUP TABLE <table> [WHERE <where-clause>] [OPTIONS] > > Supported where-clause options: > TIMESTAMP > > Other supported options: > REVS revision_count > INTO FILE filename[.gz] > > - Doug > > On Mon, Jan 18, 2010 at 10:04 PM, Sanjit Jhala <[email protected]> wrote: >> >> I assume it will also allow SELECT (list, of, cfs) FROM foo BACKUP INTO >> FILE "foo-backup.tgz". >> Also I'm wondering if the work BACKUP ought to be replaced by something >> like RANDOM or SHUFFLED to decouple this change from backups (although I >> agree that fast restores are the main use case for this feature). So, >> "SELECT * FROM foo SHUFFLED LIMIT=N;" returns N samples across all ranges >> and one can additionally choose to store the output of the SELECT into the >> tgz file for fast restores. >> >> -Sanjit >> >> >> On Mon, Jan 18, 2010 at 8:49 PM, Doug Judd <[email protected]> wrote: >>> >>> The current method of using SELECT to take table backups causes >>> efficiency problems during restore. Because the cells are dumped in-order, >>> when it comes time to restore from backup, the data ends up getting loaded >>> into one range at a time. I propose adding a BACKUP option to SELECT that >>> would cause the data to get dumped in random order (uniformly distributed >>> across key space). This will cause restores to be parallelized, since >>> ranges distributed across the cluster will receive updates simultaneously. >>> Here's example syntax: >>> >>> SELECT * FROM foo BACKUP INTO FILE "foo-backup.gz"; >>> >>> I also propose having the BACKUP option force timestamps to be dumped as >>> well, since this will preserve the table state exactly. Thoughts? >>> >>> - Doug >>> >>> >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "Hypertable Development" group. >>> To post to this group, send email to [email protected]. >>> To unsubscribe from this group, send email to >>> [email protected]. >>> For more options, visit this group at >>> http://groups.google.com/group/hypertable-dev?hl=en. >>> >> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Hypertable Development" group. >> To post to this group, send email to [email protected]. >> To unsubscribe from this group, send email to >> [email protected]. >> For more options, visit this group at >> http://groups.google.com/group/hypertable-dev?hl=en. >> > > > -- > You received this message because you are subscribed to the Google Groups > "Hypertable Development" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/hypertable-dev?hl=en. > >
-- You received this message because you are subscribed to the Google Groups "Hypertable Development" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/hypertable-dev?hl=en.
