[
https://issues.apache.org/jira/browse/HBASE-6055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286862#comment-13286862
]
Jesse Yates commented on HBASE-6055:
------------------------------------
But before a detailed description of how timestamp-based snapshots work
internally, lets answer some comments!
@Jon: I'll add more info to the document to cover this stuff, but for the
moment, lets just get it out there.
{quote}
What is the read mechanism for snapshots like? Does the snapshot act like a
read-only table or is there some special external mechanism needed to read the
data from a snapshot? You mention having to rebuild in-memory state by
replaying wals – is this a recovery situation or needed in normal reads?
{quote}
Its almost, but not quite like a table. Read of a snapshot is going to require
an external tool but after hooking up the snapshot via the external tool, it
should act just like a real table.
Snapshots are intended to happen as fast as possible, to minimize downtime for
the table. To enable that, we are just creating reference files in the snapshot
directory. My vision is that once you take a snapshot, at some point (maybe
weekly), you export the snapshot to a backup area. In the export you actually
do the copy of the referenced files - you do a direct scan of the HFile
(avoiding the top-level interface and going right to HDFS) and the WAL files.
Then when you want to read the snapshot, you can just bulk-import the HFIles
and replay the WAL files (with the WALPlayer this is relatively easy) to
rebuild the state of the table at the time of the snapshot. Its not an exact
copy (META isn't preserved), but all the actual data is there.
The caveat here is since everything is references, one of the WAL files you
reference may not actually have been closed (and therefore not readable). In
the common case this won't happen, but if you snap and immediately export, its
possible. In that case, you need to roll the WAL for the RS that haven't rolled
them yet. However, this is in the export process, so a little latency there is
tolerable, whereas avoiding this means adding latency to taking a snapshot -
bad news bears.
Keep in mind that the log files and hfiles will get regularly cleaned up. The
former will be moved to the .oldlogs directory and periodically cleaned up and
the latter get moved to the .archive directory (again with a parallel file
hierarchy, as per HBASE-5547). If the snapshot goes to read the reference file,
which tracks down to the original file and it doesn't find it, then it will
need to lookup the same file in its respective archive directory. If its not
there, then you are really hosed (except for the case mentioned in the doc
about the WALs getting cleaned up by an aggressive log cleaner, which it is
shown, is not a problem).
Haven't gotten around to implementing this yet, but it seems reasonable to
finish up (and I think Matteo was interested in working on that part).
{quote}
What is a representation of a snapshot look like in terms of META and file
system contents?
{quote}
The way I see the implementation in the end is just a bunch of files in the
/hbase/.snapshot directory. Like I mentioned above, the layout is very similar
to the layout of a table.
Lets look at an example of a table named "stuff" (snapshot names need to be
valid directory names - same as a table or CF) and has column "column" which is
hosted on servers rs-1 and rs-2. Originally, the file system will look
something like (with license taken on file names - its not exact, I know, this
is just an example) :
/hbase/
.logs/
rs-1/
WAL-rs1-1
WAL-rs1-2
rs-2/
WAL-rs2-1
WAL-rs2-2
stuff/
.tableinfo
region1
column
region1-hfile-1
region2
column
region2-hfile-1
The snapshot named "tuesday-at-nine", when completed, then just adds the
following to the directory structure (or close enough):
.snapshot/
tuesday-at-nine/
.tableinfo
.snapshotinfo
.logs
rs-1/
WAL-rs1-1.reference
WAL-rs1-2.reference
rs-2/
WAL-rs2-1.reference
WAL-rs2-2.reference
stuff/
.tableinfo
region1
column
region1-hfile-1.reference
region2
column
region2-hfile-1.reference
The only file here that isn't a reference here is the tableinfo since it is a
pretty small file (generally), so a copy seemed more prudent over doing
archiving on changes to the table info.
The original implementation updated META with file references to do hbase-level
hard links for the HFiles. AFter getting the original implementation working,
I'm going to be ripping this piece out in favor of just doing an HFile cleaner
and cleaner delegates (similar to logs) and then have a snapshot cleaner that
reads of the FS for file references.
{quote}
At some point we may get called upon to repair these, I want to make sure there
are enough breadcrumbs for this to be possible.
{quote}
How could that happen - hbase never has problems! (sarcasm)
{quote}
- hlog roll (which I believe does not trigger a flush) instead of special meta
hlog marker (this might avoid write unavailability, seems simpler that the
mechanism I suggested)
{quote}
The hlog marker is what I'm planning on doing for the timestamped based
snapshot, which is going to be far safer than doing an HLog roll and provide
less latency. With the roll, you need to not take any writes to the memstore
between the roll and the end of the snapshot (otherwise you will lose edits).
Doing meta edits into the HLog allows you to keep edits and not worry about it.
{quote}
admin initiated snapshot and admin initiated restore operations as opposed to
acting like a read only table. (not sure what happens to "newer" data after a
restore, need to reread to see if it is in there, not sure about the cost to
restore a snapshot)
{quote}
Yup, right now its all handled from HBaseAdmin. Matteo was interested in
working on the restore stuff, but depending on timing, I may end up picking up
that work when I get the taking of a snapshot working. I think part of
"snapshots" definitely includes getting back the state.
{quote}
I believe it also has an ability to read files directly from an MR job without
having to go through HBase's get/put interface. Is that in scope for HBASE-6055?
{quote}
Absolutely in scope. It just didn't come up because I considered that part of
the restore (which Matteo expressed interest). If you had to go through the
high-level interface, then you would just use the procedure Lars talks about in
his blog:
http://hadoop-hbase.blogspot.com/2012/04/timestamp-consistent-backups-in-hbase.html
The other notable change is that I'm building to support multiple snapshots
concurrently. Its really a trivial change, so I don't think its too much
feature creep, just a matter of using lists rather than a single item.
{quote}
How does this buy your more consistency? Aren't we still inconsistent at the
prepare point now instead? Can we just write the special snapshotting hlog
entry at initiation of prepare, allowing writes to continue, then adding data
elsewhere (META) to mark success in commit? We could then have some
compaction/flush time logic cleanup failed atttempt markers?
{quote}
See the above comment about timestamp based vs. point in time and the former
being all that's necessary for HBase. This means we don't take downtime and end
up with a 'fuzzy' snapshot in terms of global consistency, but is exact in
terms of HBase delivered timestamps.
The problem point-in-time snapshots overcomes is reaching distributed consensus
while still trying to maintain availability and the ability to cross
partitions. Since no one has figured out CAP and we are looking for
consistency, we have to remove some availability to reach consensus. In this
case, the agreement is over the state _of the entire table_, rather than per
region server.
Yes, this is strictly against the contract that we have on a Scan, but it is
also in line with expectations people have on what a snapshot means. Any writes
that are pending before the snapshot are allowed to commit, but any writes that
reach the RS after the snapshot time cannot be included in the snapshot. I got
a little overzealous in my reading of HBASE-50 and took it to mean global
state, but after review the only way it would work within the constraints (no
downtime) is to make it timestamp based.
But why can't we get global consistency without taking downtime?
Let's take your example of using an HLog edit to mark the start (and for ease,
lets say the end as well - as long as its durable and recoverable, it doesn't
matter if its WAL or META).
Say we start a snapshot and send a message to all the RS (lets ignore ZK for
the moment, to simplify things) that they should take a snapshot. So they write
a marker into the HLog marking the start, create references as mentioned above,
and then report to the master that they are done. When everyone is done, we
then message each RS to commit the snapshot, which is just another entry into
the WAL. Then in rebuilding the snapshot, they would just replay the WAL up to
the start (assuming the end is found).
How do we know though which writes arrived first on each RS if we just dump a
write into the WAL? Ok, so then we need to wait for the MVCC read number to
roll forward to when we got the snapshot notification _before_ we can write an
edit to the log - totally reasonable.
However, the problem arises in attempting to get a global state of the table in
a high-volume write environment. We have no guarantee that the "snapshot
commit" notification reached each of the RS at the same time. And even if it
did reach them at the same time, maybe there was some latency in getting the
write number. Or the switch was a little wonky, or it just finishing up a GC (I
could go on).
Then we have a case where we don't actually have the snapshot as of the commit,
but rather "at commit, plus or minus a bit" - not a clean snapshot (if we don't
care about being exact then we can do a much faster, lower potential latency
solution, the discussion of which is still coming, I promise). In a system that
can take millions of writes a second, that is still a non-trivial amount of
data that can change in a few milliseconds, no longer a true 'point in time'.
The only way to get that global, consistent view is to remove the availability
of the table for a short time so we know that the state is the same across all
tables.
Say we start a snapshot and the start indication doesn't reach the servers and
get started at _exactly the same time on all the servers_, which, as explained
above, is very likely. Then we let the servers commit any outstanding
writes,but they don't get to take any new writes or a short time. In this time
while they are waiting for writes to commit, we can then do all the snapshot
preparation (referencing, table info copying). Once we are ready for the
snapshot, we report back to the master and wait for the commit step. In this
time we are still not taking writes. The key here is that for that short time,
none of the servers are taking writes and that allows us to get a single point
in time that no writes are committing (but they do get buffered on the server,
they just can't change the system state).
If we let writes commit, then how do we reach a state that we can agree on
across all the servers? If you let the writes commit, you again don't have any
assurances that the prepare or the commit message time is agreed to by all the
servers. The table-level consistent state is somewhere between the prepare and
commit, but it's not clear how one would find that point - I'm pretty sure we
can't do this unless we have perfectly synchronized clocks, which is not really
possible without a better understanding of quantum mechanics :)
Block writes is a perhaps a bad phrase in this situation. In the current
implementation, it buffers the writes as threads into the server, blocking on
the updateLock. However, we can go with a "semi-blocking" version: writes still
complete, but they aren't going to be visible until we roll forward to the
snapshot MVCC number. This lets the writers complete (not affecting latency),
but is going to affect read-modify-write and reader-to-writer comparison
latency. However, as soon as we roll forward the MVCC, all those writes become
visible, essentially catching back up to the current state. A slight
modification to the WAL edits will need to be made to write the MVCC number so
we can keep track of which writes are in/out of a snapshot, but that
_shouldn't_ be too hard (famous last words). You don't even need to modify all
the WAL edits, just those made during the snapshot window, so the over the wire
cost is still kept essentially the same, when amortized over the life of a
table (for the standard use case).
I'm looking at doing this once I get the simple version working - one step at a
time. Moving to the timestamp based approach lets us keep taking writes but
does so at the cost of global consistency in favor of local consistency and
still uses the _exact same infrastructure_. The first patch I'll actually put
on RB will be the timestamp based, but let me get the stop the world version
going before going down a rabbit hole.
The only thing we don't capture is if a writer makes a request to the RS before
the snapshot is taken (by another client), but the write doesn't reach the
server until after the RS hits the start barrier. From the global client
perspective, this write should be in the snapshot, but that requires a single
client or client-side write coordination (via a timestamp oracle). However,
this is even worse coordination and creates even more constraints on the system
where we currently have no coordination between clients (and I'm against adding
any). So yes, we miss that edit, but that would be the case in a single-server
database anyways without an external timestamp manager (to again distributed
coordination between the client and server, though it can be done in a
non-blocking manner). I'll mention some of this external coordination in the
timestamp explanation.
> Snapshots in HBase 0.96
> -----------------------
>
> Key: HBASE-6055
> URL: https://issues.apache.org/jira/browse/HBASE-6055
> Project: HBase
> Issue Type: New Feature
> Components: client, master, regionserver, zookeeper
> Reporter: Jesse Yates
> Assignee: Jesse Yates
> Fix For: 0.96.0
>
> Attachments: Snapshots in HBase.docx
>
>
> Continuation of HBASE-50 for the current trunk. Since the implementation has
> drastically changed, opening as a new ticket.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira