[ 
https://issues.apache.org/jira/browse/HBASE-6055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286862#comment-13286862
 ] 

Jonathan Hsieh edited comment on HBASE-6055 at 6/1/12 11:02 PM:
----------------------------------------------------------------

_(jon: I made a minor formatting tweak to make this easier to read the dir 
structure)_

But before a detailed description of how timestamp-based snapshots work 
internally, lets answer some comments!

@Jon: I'll add more info to the document to cover this stuff, but for the 
moment, lets just get it out there.

{quote}
What is the read mechanism for snapshots like? Does the snapshot act like a 
read-only table or is there some special external mechanism needed to read the 
data from a snapshot? You mention having to rebuild in-memory state by 
replaying wals – is this a recovery situation or needed in normal reads?
{quote}

Its almost, but not quite like a table. Read of a snapshot is going to require 
an external tool but after hooking up the snapshot via the external tool, it 
should act just like a real table. 

Snapshots are intended to happen as fast as possible, to minimize downtime for 
the table. To enable that, we are just creating reference files in the snapshot 
directory. My vision is that once you take a snapshot, at some point (maybe 
weekly), you export the snapshot to a backup area. In the export you actually 
do the copy of the referenced files - you do a direct scan of the HFile 
(avoiding the top-level interface and going right to HDFS) and the WAL files. 
Then when you want to read the snapshot, you can just bulk-import the HFIles 
and replay the WAL files (with the WALPlayer this is relatively easy) to 
rebuild the state of the table at the time of the snapshot. Its not an exact 
copy (META isn't preserved), but all the actual data is there.

The caveat here is since everything is references, one of the WAL files you 
reference may not actually have been closed (and therefore not readable). In 
the common case this won't happen, but if you snap and immediately export, its 
possible. In that case, you need to roll the WAL for the RS that haven't rolled 
them yet. However, this is in the export process, so a little latency there is 
tolerable, whereas avoiding this means adding latency to taking a snapshot  - 
bad news bears.

Keep in mind that the log files and hfiles will get regularly cleaned up. The 
former will be moved to the .oldlogs directory and periodically cleaned up and 
the latter get moved to the .archive directory (again with a parallel file 
hierarchy, as per HBASE-5547). If the snapshot goes to read the reference file, 
which tracks down to the original file and it doesn't find it, then it will 
need to lookup the same file in its respective archive directory. If its not 
there, then you are really hosed (except for the case mentioned in the doc 
about the WALs getting cleaned up by an aggressive log cleaner, which it is 
shown, is not a problem).

Haven't gotten around to implementing this yet, but it seems reasonable to 
finish up (and I think Matteo was interested in working on that part).

{quote}
What is a representation of a snapshot look like in terms of META and file 
system contents?
{quote}

The way I see the implementation in the end is just a bunch of files in the 
/hbase/.snapshot directory. Like I mentioned above, the layout is very similar 
to the layout of a table. 

Lets look at an example of a table named "stuff" (snapshot names need to be 
valid directory names - same as a table or CF) and has column "column" which is 
hosted on servers rs-1 and rs-2. Originally, the file system will look 
something like (with license taken on file names - its not exact, I know, this 
is just an example) :
{code}
/hbase/
        .logs/
                rs-1/
                        WAL-rs1-1
                        WAL-rs1-2
                rs-2/
                        WAL-rs2-1
                        WAL-rs2-2
        stuff/
                .tableinfo
                region1
                        column
                                region1-hfile-1
                region2
                        column
                                region2-hfile-1
{code}

The snapshot named "tuesday-at-nine", when completed, then just adds the 
following to the directory structure (or close enough):

{code}
        .snapshot/
                tuesday-at-nine/
                        .tableinfo
                        .snapshotinfo
                        .logs
                                rs-1/
                                WAL-rs1-1.reference
                                WAL-rs1-2.reference
                        rs-2/
                                WAL-rs2-1.reference
                                WAL-rs2-2.reference
                        stuff/
                                .tableinfo
                                region1
                                        column
                                                region1-hfile-1.reference
                                region2
                                        column
                                                region2-hfile-1.reference
{code}

The only file here that isn't a reference here is the tableinfo since it is a 
pretty small file (generally), so a copy seemed more prudent over doing 
archiving on changes to the table info.

The original implementation updated META with file references to do hbase-level 
hard links for the HFiles. AFter getting the original implementation working, 
I'm going to be ripping this piece out in favor of just doing an HFile cleaner 
and cleaner delegates (similar to logs) and then have a snapshot cleaner that 
reads of the FS for file references. 

{quote}
At some point we may get called upon to repair these, I want to make sure there 
are enough breadcrumbs for this to be possible.
{quote}

How could that happen - hbase never has problems! (sarcasm)

{quote}
 - hlog roll (which I believe does not trigger a flush) instead of special meta 
hlog marker (this might avoid write unavailability, seems simpler that the 
mechanism I suggested)
{quote}

The hlog marker is what I'm planning on doing for the timestamped based 
snapshot, which is going to be far safer than doing an HLog roll and provide 
less latency. With the roll, you need to not take any writes to the memstore 
between the roll and the end of the snapshot (otherwise you will lose edits). 
Doing meta edits into the HLog allows you to keep edits and not worry about it.

{quote}
admin initiated snapshot and admin initiated restore operations as opposed to 
acting like a read only table. (not sure what happens to "newer" data after a 
restore, need to reread to see if it is in there, not sure about the cost to 
restore a snapshot)
{quote}

Yup, right now its all handled from HBaseAdmin. Matteo was interested in 
working on the restore stuff, but depending on timing, I may end up picking up 
that work when I get the taking of a snapshot working.  I think part of 
"snapshots" definitely includes getting back the state.

{quote}
I believe it also has an ability to read files directly from an MR job without 
having to go through HBase's get/put interface. Is that in scope for HBASE-6055?
{quote}

Absolutely in scope. It just didn't come up because I considered that part of 
the restore (which Matteo expressed interest). If you had to go through the 
high-level interface, then you would just use the procedure Lars talks about in 
his blog: 
http://hadoop-hbase.blogspot.com/2012/04/timestamp-consistent-backups-in-hbase.html

The other notable change is that I'm building to support multiple snapshots 
concurrently. Its really a trivial change, so I don't think its too much 
feature creep, just a matter of using lists rather than a single item. 

{quote}
How does this buy your more consistency? Aren't we still inconsistent at the 
prepare point now instead? Can we just write the special snapshotting hlog 
entry at initiation of prepare, allowing writes to continue, then adding data 
elsewhere (META) to mark success in commit? We could then have some 
compaction/flush time logic cleanup failed atttempt markers?
{quote}

See the above comment about timestamp based vs. point in time and the former 
being all that's necessary for HBase. This means we don't take downtime and end 
up with a 'fuzzy' snapshot in terms of global consistency, but is exact in 
terms of HBase delivered timestamps.

The problem point-in-time snapshots overcomes is reaching distributed consensus 
while still trying to maintain availability and the ability to cross 
partitions. Since no one has figured out CAP and we are looking for 
consistency, we have to remove some availability to reach consensus. In this 
case, the agreement is over the state _of the entire table_, rather than per 
region server. 

Yes, this is strictly against the contract that we have on a Scan, but it is 
also in line with expectations people have on what a snapshot means. Any writes 
that are pending before the snapshot are allowed to commit, but any writes that 
reach the RS after the snapshot time cannot be included in the snapshot. I got 
a little overzealous in my reading of HBASE-50 and took it to mean global 
state, but after review the only way it would work within the constraints (no 
downtime) is to make it timestamp based.

But why can't we get global consistency without taking downtime?

Let's take your example of using an HLog edit to mark the start (and for ease, 
lets say the end as well - as long as its durable and recoverable, it doesn't 
matter if its WAL or META). 

Say we start a snapshot and send a message to all the RS (lets ignore ZK for 
the moment, to simplify things) that they should take a snapshot. So they write 
a marker into the HLog marking the start, create references as mentioned above, 
and then report to the master that they are done. When everyone is done, we 
then message each RS to commit the snapshot, which is just another entry into 
the WAL. Then in rebuilding the snapshot, they would just replay the WAL up to 
the start (assuming the end is found).

How do we know though which writes arrived first on each RS if we just dump a 
write into the WAL? Ok, so then we need to wait for the MVCC read number to 
roll forward to when we got the snapshot notification _before_ we can write an 
edit to the log - totally reasonable.

However, the problem arises in attempting to get a global state of the table in 
a high-volume write environment. We have no guarantee that the "snapshot 
commit" notification reached each of the RS at the same time. And even if it 
did reach them at the same time, maybe there was some latency in getting the 
write number. Or the switch was a little wonky, or it just finishing up a GC (I 
could go on). 

Then we have a case where we don't actually have the snapshot as of the commit, 
but rather "at commit, plus or minus a bit" - not a clean snapshot (if we don't 
care about being exact then we can do a much faster, lower potential latency 
solution, the discussion of which is still coming, I promise). In a system that 
can take millions of writes a second, that is still a non-trivial amount of 
data that can change in a few milliseconds, no longer a true 'point in time'.

The only way to get that global, consistent view is to remove the availability 
of the table for a short time so we know that the state is the same across all 
tables.

Say we start a snapshot and the start indication doesn't reach the servers and 
get started at _exactly the same time on all the servers_, which, as explained 
above, is very likely. Then we let the servers commit any outstanding 
writes,but they don't get to take any new writes or a short time. In this time 
while they are waiting for writes to commit, we can then do all the snapshot 
preparation (referencing, table info copying). Once we are ready for the 
snapshot, we report back to the master and wait for the commit step. In this 
time we are still not taking writes. The key here is that for that short time, 
none of the servers are taking writes and that allows us to get a single point 
in time that no writes are committing (but they do get buffered on the server, 
they just can't change the system state).

If we let writes commit, then how do we reach a state that we can agree on 
across all the servers? If you let the writes commit, you again don't have any 
assurances that the prepare or the commit message time is agreed to by all the 
servers. The table-level consistent state is somewhere between the prepare and 
commit, but it's not clear how one would find that point - I'm pretty sure we 
can't do this unless we have perfectly synchronized clocks, which is not really 
possible without a better understanding of quantum mechanics :)

Block writes is a perhaps a bad phrase in this situation. In the current 
implementation, it buffers the writes as threads into the server, blocking on 
the updateLock. However, we can go with a "semi-blocking" version: writes still 
complete, but they aren't going to be visible until we roll forward to the 
snapshot MVCC number. This lets the writers complete (not affecting latency), 
but is going to affect read-modify-write and reader-to-writer comparison 
latency. However, as soon as we roll forward the MVCC, all those writes become 
visible, essentially catching back up to the current state. A slight 
modification to the WAL edits will need to be made to write the MVCC number so 
we can keep track of which writes are in/out of a snapshot, but that 
_shouldn't_ be too hard (famous last words). You don't even need to modify all 
the WAL edits, just those made during the snapshot window, so the over the wire 
cost is still kept essentially the same, when amortized over the life of a 
table (for the standard use case).

I'm looking at doing this once I get the simple version working - one step at a 
time. Moving to the timestamp based approach lets us keep taking writes but 
does so at the cost of global consistency in favor of local consistency and 
still uses the _exact same infrastructure_. The first patch I'll actually put 
on RB will be the timestamp based, but let me get the stop the world version 
going before going down a rabbit hole.

The only thing we don't capture is if a writer makes a request to the RS before 
the snapshot is taken (by another client), but the write doesn't reach the 
server until after the RS hits the start barrier. From the global client 
perspective, this write should be in the snapshot, but that requires a single 
client or client-side write coordination (via a timestamp oracle). However, 
this is even worse coordination and creates even more constraints on the system 
where we currently have no coordination between clients (and I'm against adding 
any). So yes, we miss that edit, but that would be the case in a single-server 
database anyways without an external timestamp manager (to again distributed 
coordination between the client and server, though it can be done in a 
non-blocking manner). I'll mention some of this external coordination in the 
timestamp explanation.
                
      was (Author: jesse_yates):
    But before a detailed description of how timestamp-based snapshots work 
internally, lets answer some comments!

@Jon: I'll add more info to the document to cover this stuff, but for the 
moment, lets just get it out there.

{quote}
What is the read mechanism for snapshots like? Does the snapshot act like a 
read-only table or is there some special external mechanism needed to read the 
data from a snapshot? You mention having to rebuild in-memory state by 
replaying wals – is this a recovery situation or needed in normal reads?
{quote}

Its almost, but not quite like a table. Read of a snapshot is going to require 
an external tool but after hooking up the snapshot via the external tool, it 
should act just like a real table. 

Snapshots are intended to happen as fast as possible, to minimize downtime for 
the table. To enable that, we are just creating reference files in the snapshot 
directory. My vision is that once you take a snapshot, at some point (maybe 
weekly), you export the snapshot to a backup area. In the export you actually 
do the copy of the referenced files - you do a direct scan of the HFile 
(avoiding the top-level interface and going right to HDFS) and the WAL files. 
Then when you want to read the snapshot, you can just bulk-import the HFIles 
and replay the WAL files (with the WALPlayer this is relatively easy) to 
rebuild the state of the table at the time of the snapshot. Its not an exact 
copy (META isn't preserved), but all the actual data is there.

The caveat here is since everything is references, one of the WAL files you 
reference may not actually have been closed (and therefore not readable). In 
the common case this won't happen, but if you snap and immediately export, its 
possible. In that case, you need to roll the WAL for the RS that haven't rolled 
them yet. However, this is in the export process, so a little latency there is 
tolerable, whereas avoiding this means adding latency to taking a snapshot  - 
bad news bears.

Keep in mind that the log files and hfiles will get regularly cleaned up. The 
former will be moved to the .oldlogs directory and periodically cleaned up and 
the latter get moved to the .archive directory (again with a parallel file 
hierarchy, as per HBASE-5547). If the snapshot goes to read the reference file, 
which tracks down to the original file and it doesn't find it, then it will 
need to lookup the same file in its respective archive directory. If its not 
there, then you are really hosed (except for the case mentioned in the doc 
about the WALs getting cleaned up by an aggressive log cleaner, which it is 
shown, is not a problem).

Haven't gotten around to implementing this yet, but it seems reasonable to 
finish up (and I think Matteo was interested in working on that part).

{quote}
What is a representation of a snapshot look like in terms of META and file 
system contents?
{quote}

The way I see the implementation in the end is just a bunch of files in the 
/hbase/.snapshot directory. Like I mentioned above, the layout is very similar 
to the layout of a table. 

Lets look at an example of a table named "stuff" (snapshot names need to be 
valid directory names - same as a table or CF) and has column "column" which is 
hosted on servers rs-1 and rs-2. Originally, the file system will look 
something like (with license taken on file names - its not exact, I know, this 
is just an example) :
/hbase/
        .logs/
                rs-1/
                        WAL-rs1-1
                        WAL-rs1-2
                rs-2/
                        WAL-rs2-1
                        WAL-rs2-2
        stuff/
                .tableinfo
                region1
                        column
                                region1-hfile-1
                region2
                        column
                                region2-hfile-1

The snapshot named "tuesday-at-nine", when completed, then just adds the 
following to the directory structure (or close enough):

        .snapshot/
                tuesday-at-nine/
                        .tableinfo
                        .snapshotinfo
                        .logs
                                rs-1/
                                WAL-rs1-1.reference
                                WAL-rs1-2.reference
                        rs-2/
                                WAL-rs2-1.reference
                                WAL-rs2-2.reference
                        stuff/
                                .tableinfo
                                region1
                                        column
                                                region1-hfile-1.reference
                                region2
                                        column
                                                region2-hfile-1.reference

The only file here that isn't a reference here is the tableinfo since it is a 
pretty small file (generally), so a copy seemed more prudent over doing 
archiving on changes to the table info.

The original implementation updated META with file references to do hbase-level 
hard links for the HFiles. AFter getting the original implementation working, 
I'm going to be ripping this piece out in favor of just doing an HFile cleaner 
and cleaner delegates (similar to logs) and then have a snapshot cleaner that 
reads of the FS for file references. 

{quote}
At some point we may get called upon to repair these, I want to make sure there 
are enough breadcrumbs for this to be possible.
{quote}

How could that happen - hbase never has problems! (sarcasm)

{quote}
 - hlog roll (which I believe does not trigger a flush) instead of special meta 
hlog marker (this might avoid write unavailability, seems simpler that the 
mechanism I suggested)
{quote}

The hlog marker is what I'm planning on doing for the timestamped based 
snapshot, which is going to be far safer than doing an HLog roll and provide 
less latency. With the roll, you need to not take any writes to the memstore 
between the roll and the end of the snapshot (otherwise you will lose edits). 
Doing meta edits into the HLog allows you to keep edits and not worry about it.

{quote}
admin initiated snapshot and admin initiated restore operations as opposed to 
acting like a read only table. (not sure what happens to "newer" data after a 
restore, need to reread to see if it is in there, not sure about the cost to 
restore a snapshot)
{quote}

Yup, right now its all handled from HBaseAdmin. Matteo was interested in 
working on the restore stuff, but depending on timing, I may end up picking up 
that work when I get the taking of a snapshot working.  I think part of 
"snapshots" definitely includes getting back the state.

{quote}
I believe it also has an ability to read files directly from an MR job without 
having to go through HBase's get/put interface. Is that in scope for HBASE-6055?
{quote}

Absolutely in scope. It just didn't come up because I considered that part of 
the restore (which Matteo expressed interest). If you had to go through the 
high-level interface, then you would just use the procedure Lars talks about in 
his blog: 
http://hadoop-hbase.blogspot.com/2012/04/timestamp-consistent-backups-in-hbase.html

The other notable change is that I'm building to support multiple snapshots 
concurrently. Its really a trivial change, so I don't think its too much 
feature creep, just a matter of using lists rather than a single item. 

{quote}
How does this buy your more consistency? Aren't we still inconsistent at the 
prepare point now instead? Can we just write the special snapshotting hlog 
entry at initiation of prepare, allowing writes to continue, then adding data 
elsewhere (META) to mark success in commit? We could then have some 
compaction/flush time logic cleanup failed atttempt markers?
{quote}

See the above comment about timestamp based vs. point in time and the former 
being all that's necessary for HBase. This means we don't take downtime and end 
up with a 'fuzzy' snapshot in terms of global consistency, but is exact in 
terms of HBase delivered timestamps.

The problem point-in-time snapshots overcomes is reaching distributed consensus 
while still trying to maintain availability and the ability to cross 
partitions. Since no one has figured out CAP and we are looking for 
consistency, we have to remove some availability to reach consensus. In this 
case, the agreement is over the state _of the entire table_, rather than per 
region server. 

Yes, this is strictly against the contract that we have on a Scan, but it is 
also in line with expectations people have on what a snapshot means. Any writes 
that are pending before the snapshot are allowed to commit, but any writes that 
reach the RS after the snapshot time cannot be included in the snapshot. I got 
a little overzealous in my reading of HBASE-50 and took it to mean global 
state, but after review the only way it would work within the constraints (no 
downtime) is to make it timestamp based.

But why can't we get global consistency without taking downtime?

Let's take your example of using an HLog edit to mark the start (and for ease, 
lets say the end as well - as long as its durable and recoverable, it doesn't 
matter if its WAL or META). 

Say we start a snapshot and send a message to all the RS (lets ignore ZK for 
the moment, to simplify things) that they should take a snapshot. So they write 
a marker into the HLog marking the start, create references as mentioned above, 
and then report to the master that they are done. When everyone is done, we 
then message each RS to commit the snapshot, which is just another entry into 
the WAL. Then in rebuilding the snapshot, they would just replay the WAL up to 
the start (assuming the end is found).

How do we know though which writes arrived first on each RS if we just dump a 
write into the WAL? Ok, so then we need to wait for the MVCC read number to 
roll forward to when we got the snapshot notification _before_ we can write an 
edit to the log - totally reasonable.

However, the problem arises in attempting to get a global state of the table in 
a high-volume write environment. We have no guarantee that the "snapshot 
commit" notification reached each of the RS at the same time. And even if it 
did reach them at the same time, maybe there was some latency in getting the 
write number. Or the switch was a little wonky, or it just finishing up a GC (I 
could go on). 

Then we have a case where we don't actually have the snapshot as of the commit, 
but rather "at commit, plus or minus a bit" - not a clean snapshot (if we don't 
care about being exact then we can do a much faster, lower potential latency 
solution, the discussion of which is still coming, I promise). In a system that 
can take millions of writes a second, that is still a non-trivial amount of 
data that can change in a few milliseconds, no longer a true 'point in time'.

The only way to get that global, consistent view is to remove the availability 
of the table for a short time so we know that the state is the same across all 
tables.

Say we start a snapshot and the start indication doesn't reach the servers and 
get started at _exactly the same time on all the servers_, which, as explained 
above, is very likely. Then we let the servers commit any outstanding 
writes,but they don't get to take any new writes or a short time. In this time 
while they are waiting for writes to commit, we can then do all the snapshot 
preparation (referencing, table info copying). Once we are ready for the 
snapshot, we report back to the master and wait for the commit step. In this 
time we are still not taking writes. The key here is that for that short time, 
none of the servers are taking writes and that allows us to get a single point 
in time that no writes are committing (but they do get buffered on the server, 
they just can't change the system state).

If we let writes commit, then how do we reach a state that we can agree on 
across all the servers? If you let the writes commit, you again don't have any 
assurances that the prepare or the commit message time is agreed to by all the 
servers. The table-level consistent state is somewhere between the prepare and 
commit, but it's not clear how one would find that point - I'm pretty sure we 
can't do this unless we have perfectly synchronized clocks, which is not really 
possible without a better understanding of quantum mechanics :)

Block writes is a perhaps a bad phrase in this situation. In the current 
implementation, it buffers the writes as threads into the server, blocking on 
the updateLock. However, we can go with a "semi-blocking" version: writes still 
complete, but they aren't going to be visible until we roll forward to the 
snapshot MVCC number. This lets the writers complete (not affecting latency), 
but is going to affect read-modify-write and reader-to-writer comparison 
latency. However, as soon as we roll forward the MVCC, all those writes become 
visible, essentially catching back up to the current state. A slight 
modification to the WAL edits will need to be made to write the MVCC number so 
we can keep track of which writes are in/out of a snapshot, but that 
_shouldn't_ be too hard (famous last words). You don't even need to modify all 
the WAL edits, just those made during the snapshot window, so the over the wire 
cost is still kept essentially the same, when amortized over the life of a 
table (for the standard use case).

I'm looking at doing this once I get the simple version working - one step at a 
time. Moving to the timestamp based approach lets us keep taking writes but 
does so at the cost of global consistency in favor of local consistency and 
still uses the _exact same infrastructure_. The first patch I'll actually put 
on RB will be the timestamp based, but let me get the stop the world version 
going before going down a rabbit hole.

The only thing we don't capture is if a writer makes a request to the RS before 
the snapshot is taken (by another client), but the write doesn't reach the 
server until after the RS hits the start barrier. From the global client 
perspective, this write should be in the snapshot, but that requires a single 
client or client-side write coordination (via a timestamp oracle). However, 
this is even worse coordination and creates even more constraints on the system 
where we currently have no coordination between clients (and I'm against adding 
any). So yes, we miss that edit, but that would be the case in a single-server 
database anyways without an external timestamp manager (to again distributed 
coordination between the client and server, though it can be done in a 
non-blocking manner). I'll mention some of this external coordination in the 
timestamp explanation.
                  
> Snapshots in HBase 0.96
> -----------------------
>
>                 Key: HBASE-6055
>                 URL: https://issues.apache.org/jira/browse/HBASE-6055
>             Project: HBase
>          Issue Type: New Feature
>          Components: client, master, regionserver, zookeeper
>            Reporter: Jesse Yates
>            Assignee: Jesse Yates
>             Fix For: 0.96.0
>
>         Attachments: Snapshots in HBase.docx
>
>
> Continuation of HBASE-50 for the current trunk. Since the implementation has 
> drastically changed, opening as a new ticket.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to