[
https://issues.apache.org/jira/browse/HBASE-50?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852393#action_12852393
]
stack commented on HBASE-50:
----------------------------
Let me freshen this issue by synopsizing what has gone before and adding some
new remarks (This issue has been labeled gsoc so candidates are taking a look
trying to figure whats involved).
The current state of a table is comprised of the memstore content of all
currently running regionservers and the content of the
/${hbase.rootdir}/TABLENAME directory in HDFS. There is also metadata --
mainly the current state of its schema and the regions of which it is comprised
-- kept up in .META. "catalog" table.
In the above, there is talk of 'offlining' or flipping the table 'read-only'.
Realistically we cannot require a production hbase be offlined or even flipped
read-only making a snapshot.
So, snapshotting needs to something like Clint Morgan describes above where we
don't depend on memstore flush, where we instead get the memstore content from
WAL files, and where snapshotting is triggered by sending a signal across the
cluster, probably best distributed via zookeeper. On receipt, regionservers
roll their WAL and dump a manifest of all the files they have in their care;
e.g. a list of all regions they are currently serving and the files of which
these are comprised. The regionserver hosting the .META. would also dump the
metadata for said table at this time. During this WAL roll and manifest dump,
we would continue to take on writes but there might be some transitions we'd
suspend for the short time it would take to write out manifest, etc., such as
memstore flushing.
So a snapshot can be saved aside or reconstituted in-situ, regionservers cannot
delete files. For example, a compaction instead moves the old files aside
rather than delete them when the new compacted file is made. The moved aside,
non-deleted files might be copied elsewhere by some other process (a distcp MR
job). They might also be cleaned up by a cleaner that could distingush old
snapshots from the more recent (be careful not to remove files still in use).
We'd need a script, something like add_table.rb that could fixup .META. with
the snapshotted .META. state and that moved files around in the filesystem --
from snapshot location back into the running position.
> Snapshot of table
> -----------------
>
> Key: HBASE-50
> URL: https://issues.apache.org/jira/browse/HBASE-50
> Project: Hadoop HBase
> Issue Type: New Feature
> Reporter: Billy Pearson
> Priority: Minor
>
> Havening an option to take a snapshot of a table would be vary useful in
> production.
> What I would like to see this option do is do a merge of all the data into
> one or more files stored in the same folder on the dfs. This way we could
> save data in case of a software bug in hadoop or user code.
> The other advantage would be to be able to export a table to multi locations.
> Say I had a read_only table that must be online. I could take a snapshot of
> it when needed and export it to a separate data center and have it loaded
> there and then i would have it online at multi data centers for load
> balancing and failover.
> I understand that hadoop takes the need out of havening backup to protect
> from failed servers, but this does not protect use from software bugs that
> might delete or alter data in ways we did not plan. We should have a way we
> can roll back a dataset.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.