[
https://issues.apache.org/jira/browse/HBASE-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13290554#comment-13290554
]
Jesse Yates commented on HBASE-6180:
------------------------------------
Here is what I've been thinking about for doing timestamp based snapshotting,
as an extension to the work I've been doing for HBASE-6055.
Timestamp based snapshots are a zero-downtime/non-blocking versions of taking a
snapshot across a table in HBase. They should be considered 'fuzzy' because you
don't get a global view, but only as close to globally consistent as we can get
with timestamps on the region servers (fuzziness is in the NTP different
between RS, which defaults to max skew of 60 sec). I'm going to mingle a bit of
theory and implementation here, but feel free to ask questions for things that
don't make sense.
All the infrastructure from point-in-time snapshots (HBASE-50) is still going
to be used here: SnapshotManager on the Master, the RegionSnapshotHandler on
the RS, etc. The only change is what actually happens on each of the regions
when taking the snapshot and how the snapshot is managed on the Master. Also,
on a lower level, the time constraints are much looser on taking the snapshot.
Lets walk throughout some of the changes to the actual implementation.
>From a high-level, we still tell all the RS to start the snapshot. They will
>then dump a meta edit into the WAL with the memstore timestamp (not clear if
>this is necessary, but could be useful for completing snapshots on failed RS).
>They will then post back to ZK that they are starting the snapshot. Each RS
>can then go about their business adding references to all the files on the FS
>for the Regions involved in the snapshot. It gets a little tricky when we try
>to capture the in-memory state of each RS.
The key here though is that we can use the Memstore's built-in snapshot
functionality to avoid doing any work with the WALs and just keep track of
HFiles. When flushing the Memstore takes a "snapshot" by just blocking for a
moment to switch two pointers between the current and the new memstore. All
writes before the switch go into the old memstore. All new writes go into the
new memstore. The old memstore can then asynchronously be flushed to disk and
on scan we just merge in the results from the old version as well as the on
disk files. The benefit of this is that we basically take no down time to flush
the memstore (except for corner cases where there are too many HFiles on disk
already, but we can ignore that as it part of the overall HBase design).
We can leverage the same mechanism but instead just make the swap time-based.
When the RS gets the update to take the snapshot, it also has a timespan
through which it should split writes between the memstores. For example, say we
get a snapshot start notice at 10:15:00 and a prepare phase length of 80
seconds (the max skew in the cluster +20 seconds for safety - just an example).
For those 80 seconds, each HRegion will then time-flush the memstore. We take a
regular memstore snapshot. Just like a regular flush, this ensures that all the
outstanding writes to the memstore get committed (waiting on the read point to
roll forward). However, instead of immediately writing to the new store, we
split writes based on timestamp between the old and the new memstore. This
management is handled by the Store, which just does some simple checking on the
edits coming through to see which memstore it should direct the writes
(admittedly, hand-waving away some of the complexity here).
Conceptually, this is like taking snapshot, but instead of just having the
snapshot be the immutable state (less the rollbacks made), we can just pass
that KV set into a new MemStore that acts just like a regular memstore. Since
all the high-level edits still go through the mvcc, we can keep track of the
ordering in writes and the rollback mechanism on the original MemStore actually
keeps its own state and the state of the snapshot-based MemStore in the correct
state.
At this point, we can update the master (via ZK) that we have joined the
snapshot. This is not strictly necessary, but is nice since we can then track
progress of a snapshot. For instance, if a RS hasn't responded in within a
certain window, we can immediately fail the snapshot and assume the RS has
become inoperable. Since we are using the internal flushing mechanisms to
remain mostly non-blocking, we can actually skip doing this update and just
notify the master when we have done the write.
An alternative implementation here is to do what Jon has suggested and do a set
a meta writes for the beginning and end of a snapshot. Then all you have to do
is keep track of the WALs for the snapshot and replay those at the right time.
However, that adds some complexity into how to restore a snapshot and may
require rolling the WAL after the snapshot has been taken - a worrisome amount
of complexity for something that should be entirely immutable. Since the flush
can be done async and we don't block writes while waiting, it doesn't seem like
a major issue to wait a little longer to complete a snapshot.
Back to the dual-pointer memstore snapshot implementation, once we pass the 80
seconds, we then flush the old store to disk, add a reference to the new HFile,
and then just direct all writes to the new store. Conceptually, this all seems
to hang together, but the implementation is probably going to take a little
more work.
There is a slight overhead to writes during the snapshot window. We will need
to check the timestamp of every write going into the memstore, to figure out
the store it needs to be written into. However, that is just a simple timestamp
comparison and shouldn't be overly burdensome to the write throughput
(especially if you can take a snapshot during a low-write period).
After this snapshot window, the state of the memstore will have been
snapshotted and a flush will have been started. Now we can just flush this old
memstore to disk as another HFile and add a reference to it for the snapshot.
Its completely fine if this process takes a while because the server precedes
happily, taking reads and writes like nothing is amiss because the semantics
are the same as a regular flush. Once the file hits disk (and we have added
references for each of the other files) we can consider the snapshot completed
on that HRegion. Once that process completes for all involved HRegions on the
HRegionServer we can consider the snapshot having completed the snapshot.
Note that since the in memory state is all written to disk, we don't actually
need to keep track of any of the HLogs. There is probably some re-jiggering
here around failed Puts and the optimized write path there, but that is an
implementation detail.
Once all the HRegionServers have taken the snapshot (passing up the
notification by joining the barrier), the Master considers the snapshot
completed and can move the snapshot from the .tmp to the .snapshot directory.
The complete barrier is then just a barrier for the master, rather than for the
region servers since there is not coordination necessary except to determine if
a snapshot failed because a RS couldn't complete (which only the master needs
to keep track of, to determine if a snapshot is valid or not).
There are some gotcha's with snapshotting with timestamps.
Suppose you are putting writes into the future. On a regular table doing a
timestamp based Scan will still not find those futures writes; the same will be
true of the snapshotted table - those writes will be directed to the new store
and not found in the snapshot.
The only weirdness that occurs with this form of snapshots is with future/past
writes - essentially any time you start messing with the timestamps. Let's look
an example. At 10:15:00 you take a snapshot of a table. However, on the same
table, you make a Put - 'row', 'cf', 10:20:00, 'value' - at 10:10:00, a put in
the future but made _before_ you take a snapshot. The snapshot then precedes as
expected. At some point later, you revive the snapshot and do a scan of the
table with a timestamp of 10:15:00.; you won't find that earlier put ('row',
'cf', 10:20:00, 'value'). However, if you just do a scan for the latest
version, you *will* find that put!
It gets even odder if instead of making that future put before the snapshot was
taken, but instead made it _while_ the snapshot was being taken. In this case,
the revived snapshot will give you different semantics. The scan of the
snapshot at 10:15:00 will still give you the same answer as before, but the
latest version scan _will not find_ the future Put ('row', 'cf', 10:20:00,
'value').
Unfortunately, these are the semantics of using timestamps over global
consistency. I (and many others) feel that if you are messing with timestamps
then its buyer beware.
That said, there is way to get global consistency if you do mess with
timestamps. If you have some centralized timestamp oracle, then this can give
out strictly increasing timestamps with a lease for the timestamps. (I've got
a long flight next week where I'm hoping to pump out a basic implementation of
this for hbase - no ticket, but just a little something on github). Since you
know that the timestamps will expire after a given period, you just set the
expiration time + fudge as the timespan to split the memstore writes. After
the expiration period you know that a timestamp is the oldest timestamp, so you
can then comfortably flush the old memstore to disk, knowing that you have all
the edits from that timestamp back in time. Note that you don't have the same
problem as above since you only do scans in terms of the timestamps from the
oracle, so future and past are really globally relative - there is no real puts
too far into the future or past that are visible because all scans need to be
as of a timestamp.
> [brainstorm] Timestamp based snapshots in HBase 0.96
> ----------------------------------------------------
>
> Key: HBASE-6180
> URL: https://issues.apache.org/jira/browse/HBASE-6180
> Project: HBase
> Issue Type: Brainstorming
> Reporter: Jesse Yates
> Fix For: 0.96.0
>
>
> Discussion ticket around doing timestamp based snapshots in HBase as an
> extension/follow-on work for HBASE-6055. The implementation in HBASE-6055 (as
> originally defined) is not sufficient for real-time clusters because it
> requires downtime to take the snapshot.
> Time-stamp based snapshots should not require downtime at the cost of
> achieving global consistency.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira