[ 
https://issues.apache.org/jira/browse/HBASE-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13290554#comment-13290554
 ] 

Jesse Yates commented on HBASE-6180:
------------------------------------

Here is what I've been thinking about for doing timestamp based snapshotting, 
as an extension to the work I've been doing for HBASE-6055.

Timestamp based snapshots are a zero-downtime/non-blocking versions of taking a 
snapshot across a table in HBase. They should be considered 'fuzzy' because you 
don't get a global view, but only as close to globally consistent as we can get 
with timestamps on the region servers (fuzziness is in the NTP different 
between RS, which defaults to max skew of 60 sec). I'm going to mingle a bit of 
theory and implementation here, but feel free to ask questions for things that 
don't make sense.

All the infrastructure from point-in-time snapshots (HBASE-50) is still going 
to be used here: SnapshotManager on the Master, the RegionSnapshotHandler on 
the RS, etc. The only change is what actually happens on each of the regions 
when taking the snapshot and how the snapshot is managed on the Master. Also, 
on a lower level, the time constraints are much looser on taking the snapshot.

Lets walk throughout some of the changes to the actual implementation. 

>From a high-level, we still tell all the RS to start the snapshot. They will 
>then dump a meta edit into the WAL with the memstore timestamp (not clear if 
>this is necessary, but could be useful for completing snapshots on failed RS). 
>They will then post back to ZK that they are starting the snapshot. Each RS 
>can then go about their business adding references to all the files on the FS 
>for the Regions involved in the snapshot. It gets a little tricky when we try 
>to capture the in-memory state of each RS.

The key here though is that we can use the Memstore's built-in snapshot 
functionality to avoid doing any work with the WALs and just keep track of 
HFiles. When flushing the Memstore takes a "snapshot" by just blocking for a 
moment to switch two pointers between the current and the new memstore. All 
writes before the switch go into the old memstore. All new writes go into the 
new memstore. The old memstore can then asynchronously be flushed to disk and 
on scan we just merge in the results from the old version as well as the on 
disk files. The benefit of this is that we basically take no down time to flush 
the memstore (except for corner cases where there are too many HFiles on disk 
already, but we can ignore that as it part of the overall HBase design). 

We can leverage the same mechanism but instead just make the swap time-based.

When the RS gets the update to take the snapshot, it also has a timespan 
through which it should split writes between the memstores. For example, say we 
get a snapshot start notice at 10:15:00 and a prepare phase length of 80 
seconds (the max skew in the cluster +20 seconds for safety - just an example). 

For those 80 seconds, each HRegion will then time-flush the memstore. We take a 
regular memstore snapshot. Just like a regular flush, this ensures that all the 
outstanding writes to the memstore get committed (waiting on the read point to 
roll forward). However, instead of immediately writing to the new store, we 
split writes based on timestamp between the old and the new memstore. This 
management is handled by the Store, which just does some simple checking on the 
edits coming through to see which memstore it should direct the writes 
(admittedly, hand-waving away some of the complexity here).

Conceptually, this is like taking  snapshot, but instead of just having the 
snapshot be the immutable state (less the rollbacks made), we can just pass 
that KV set into a new MemStore that acts just like a regular memstore. Since 
all the high-level edits still go through the mvcc, we can keep track of the 
ordering in writes and the rollback mechanism on the original MemStore actually 
keeps its own state and the state of the snapshot-based MemStore in the correct 
state.

At this point, we can update the master (via ZK) that we have joined the 
snapshot. This is not strictly necessary, but is nice since we can then track 
progress of a snapshot. For instance, if a RS hasn't responded in within a 
certain window, we can immediately fail the snapshot and assume the RS has 
become inoperable. Since we are using the internal flushing mechanisms to 
remain mostly non-blocking, we can actually skip doing this update and just 
notify the master when we have done the write.

An alternative implementation here is to do what Jon has suggested and do a set 
a meta writes for the beginning and end of a snapshot. Then all you have to do 
is keep track of the WALs for the snapshot and replay those at the right time. 
However, that adds some complexity into how to restore a snapshot and may 
require rolling the WAL after the snapshot has been taken - a worrisome amount 
of complexity for something that should be entirely immutable. Since the flush 
can be done async and we don't block writes while waiting, it doesn't seem like 
a major issue to wait a little longer to complete a snapshot.

Back to the dual-pointer memstore snapshot implementation, once we pass the 80 
seconds, we then flush the old store to disk, add a reference to the new HFile, 
and then just direct all writes to the new store. Conceptually, this all seems 
to hang together, but the implementation is probably going to take a little 
more work.

There is a slight overhead to writes during the snapshot window. We will need 
to check the timestamp of every write going into the memstore, to figure out 
the store it needs to be written into. However, that is just a simple timestamp 
comparison and shouldn't be overly burdensome to the write throughput 
(especially if you can take a snapshot during a low-write period).

After this snapshot window, the state of the memstore will have been 
snapshotted and a flush will have been started. Now we can just flush this old 
memstore to disk as another HFile and add a reference to it for the snapshot. 
Its completely fine if this process takes a while because the server precedes 
happily, taking reads and writes like nothing is amiss because the semantics 
are the same as a regular flush. Once the file hits disk (and we have added 
references for each of the other files) we can consider the snapshot completed 
on that HRegion. Once that process completes for all involved HRegions on the 
HRegionServer we can consider the snapshot having completed the snapshot. 

Note that since the in memory state is all written to disk, we don't actually 
need to keep track of any of the HLogs. There is probably some re-jiggering 
here around failed Puts and the optimized write path there, but that is an 
implementation detail.

Once all the HRegionServers have taken the snapshot (passing up the 
notification by joining the barrier), the Master considers the snapshot 
completed and can move the snapshot from the .tmp to the .snapshot directory. 
The complete barrier is then just a barrier for the master, rather than for the 
region servers since there is not coordination necessary except to determine if 
a snapshot failed because a RS couldn't complete (which only the master needs 
to keep track of, to determine if a snapshot is valid or not).

There are some gotcha's with snapshotting with timestamps.

Suppose you are putting writes into the future. On a regular table doing a 
timestamp based Scan will still not find those futures writes; the same will be 
true of the snapshotted table - those writes will be directed to the new store 
and not found in the snapshot. 

The only weirdness that occurs with this form of snapshots is with future/past 
writes - essentially any time you start messing with the timestamps. Let's look 
an example. At 10:15:00 you take a snapshot of a table. However, on the same 
table, you make a Put  - 'row', 'cf', 10:20:00, 'value' - at 10:10:00, a put in 
the future but made _before_ you take a snapshot. The snapshot then precedes as 
expected. At some point later, you revive the snapshot and do a scan of the 
table with a timestamp of 10:15:00.; you won't find that earlier put ('row', 
'cf', 10:20:00, 'value'). However, if you just do a scan for the latest 
version, you *will* find that put!

It gets even odder if instead of making that future put before the snapshot was 
taken, but instead made it _while_ the snapshot was being taken. In this case, 
the revived snapshot will give you different semantics. The scan of the 
snapshot at 10:15:00 will still give you the same answer as before, but the 
latest version scan _will not find_ the future Put ('row', 'cf', 10:20:00, 
'value'). 

Unfortunately, these are the semantics of using timestamps over global 
consistency. I (and many others) feel that if you are messing with timestamps 
then its buyer beware. 

That said, there is way to get global consistency if you do mess with 
timestamps. If you have some centralized timestamp oracle, then this can give 
out strictly increasing timestamps with a lease for the timestamps.  (I've got 
a long flight next week where I'm hoping to pump out a basic implementation of 
this for hbase - no ticket, but just a little something on github). Since you 
know that the timestamps will expire after a given period, you just set the 
expiration time + fudge  as the timespan to split the memstore writes. After 
the expiration period you know that a timestamp is the oldest timestamp, so you 
can then comfortably flush the old memstore to disk, knowing that you have all 
the edits from that timestamp back in time. Note that you don't have the same 
problem as above since you only do scans in terms of the timestamps from the 
oracle, so future and past are really globally relative - there is no real puts 
too far into the future or past that are visible because all scans need to be 
as of a timestamp.
                
> [brainstorm] Timestamp based snapshots in HBase 0.96
> ----------------------------------------------------
>
>                 Key: HBASE-6180
>                 URL: https://issues.apache.org/jira/browse/HBASE-6180
>             Project: HBase
>          Issue Type: Brainstorming
>            Reporter: Jesse Yates
>             Fix For: 0.96.0
>
>
> Discussion ticket around doing timestamp based snapshots in HBase as an 
> extension/follow-on work for HBASE-6055. The implementation in HBASE-6055 (as 
> originally defined) is not sufficient for real-time clusters because it 
> requires downtime to take the snapshot. 
> Time-stamp based snapshots should not require downtime at the cost of 
> achieving global consistency. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to