[
https://issues.apache.org/jira/browse/HBASE-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899119#action_12899119
]
ryan rawson commented on HBASE-2856:
------------------------------------
This issue is pernicious and difficult. I have attempted many times to nail
this but there continue to be issues.
Right now there are at least 2 issues:
- When a memstore is flushed to HFile the 'memstoreTS' generation stamp is
dropped. With the way the scanners work right now, this causes some minor
issue. A row is really composed on a sequence of KeyValues, with different
memstoreTS values, so you can imagine right before the flush the scanner looks
sort of like so:
scanner points here
(new gen here) |
K K K K K K K K K K K K K K K K K K K K K (keys for 1 row)
(new and old gen)
The new gen have newer memstoreTS value and were ignored during the next(). So
the scanner is pointing to a Key part of the way through a row. When the
memstore is pushed to a HFile, we mistakenly ignore all the skipped new gen
keys, and we end up with the FIRST KeyValue from the older gen, and the
subsequent columns from the newer gen, ending up with a mixed generation
result, thus looking like non-atomic put.
One solution to this problem is to take the top Key during a scanner reset
(done after a switch between memstore -> hfile) and transform it into
KeyValue.createFirstOnRow(). There has to be exception code for partial row
fetches where the scanner gets partial row results each time (Because the row
is too big).
- The second issue is that of scanner updates and multi-column families. Right
now we do code sort of like this:
for each store (aka: family) in region:
switch memstore -> hfile
For each Store/family the switch between memstore->hfile is done atomically
with respect to scanners, that is for a scanner they either access the old
snapshot and old files or the new null snapshot and the new files. But this
atomic snapshot is done at a Store at-a-time level, meaning that concurrent
scanners might see some data in one store from memstore (complete with
generation timestamps) and some data in another store all from hfile (with no
timestamps), thus again ending up with mixed generation row data that violates
row atomicity.
One possible solution to this problem is to move scanner updates to the
HRegion/RegionScanner level. This is not as easy as it sounds from a code
structure point of view. Furthermore by using larger update blocks we may be
introducing performance issues surrounding scanner updates.
-----------
The ultimate problem here is these two problems are just the most recently
identified bugs in a long list of bugs. We can fix them in the ways described
above, but what about the next series of bugs? We are plugging leaks in a dam.
A better approach would be to build a new, better, dam, only with the
knowledge we have about what previously went wrong.
------------
The proposal is thusly:
- Add a new generation timestamp (or perhaps we could call it transaction id)
to every KeyValue. This generation timestamp is also serialized and stored
inside HFile.
- The generation timestamp starts at 1 for brand new tables. Every write
operation increments this by 1, much like ReadWriteConsistencyControl
- KeyValues with ts=0 belong to no generation and are always included in every
read operation
- KeyValues with ts>0 belong to that specific generation and the reader should
include them only if their own read point is larger. The reader obtains the
read point at scanner start time, and may possibly update it between rows if
the implementation can do so. (As per the current RWCC algorithms)
- Bulk imported hfiles have all their TS=0 (the default value)
- Modifications set their ts=X as per the current RWCC algorithms.
- During hfile flush the largest TS is written to the HFile metadata.
- During HRegion/Store init, the largest TS is obtained from HFile metadata,
and the RWCC is set to this value.
Some predicted questions:
- What about the extra space? 8 bytes is a lot.
-- With vint encoding and that the values start at 1, we should be able to
shave a lot of space off. We will also have to store a length field with 1
extra byte. 2 bytes minimum for '0'.
- What about vint decode speed?
-- It may not be an issue, we need to profile. If it is, we can trade off RAM
space in KeyValue to cache this.
- What about file migration? I dont want to take a long time to upgrade my
cluster!
-- By creating a new metadata key, call it KEYVALUE_VERSION, the Store layer on
top of HFile can parse the older KeyValues from HFile (which wont have this
metadata), substituting '0' for the missing timestamp.
-- Newer HFiles will use the newer KeyValue parse code to obtain the generation
timestamp.
- This is a lot of cost just to fix something that seems we should know in
memory
-- Yes it isn't cheap, but this problem has proven to be difficult to solve. By
addressing the issue directly we can have confidence in our design and the
existing test frameworks which are failing should validate it for us.
- What about splits?
-- Each split region will 'take off' from the maximal generation timestamp and
evolve independently. If they were to be merged together that'd be ok, each
region would have to close, halting all writes, then take the largest
generation timestamp and continue evolving from there. The timestamp exists to
differentiate committed and non-committed data, and during a close by
definition all values in the HFile would be committed.
- Ok so I'm paying at least 2 bytes per KeyValue, can we get better return on
investment?
-- Yes! By using the generation timestamp we can detect the situation where a
previously inserted Delete is masking new Puts. We can switch between the
existing behaviour and newer behaviour at runtime and ignore the rogue Delete
marker.
-- There may be other benefits as well. Providing a partial total order to all
the KeyValue inserts and grouped by transaction would provide useful debugging
info, for both HBase and client bugs.
- What about read isolation?
-- With this we can definitely implement at-scanner-open read isolation. Due
to the nature of the scanner design advancing the generation timestamp read
point is more difficult, but it may be possible.
> TestAcidGuarantee broken on trunk
> ----------------------------------
>
> Key: HBASE-2856
> URL: https://issues.apache.org/jira/browse/HBASE-2856
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.89.20100621
> Reporter: ryan rawson
> Assignee: ryan rawson
> Priority: Blocker
> Fix For: 0.90.0
>
>
> TestAcidGuarantee has a test whereby it attempts to read a number of columns
> from a row, and every so often the first column of N is different, when it
> should be the same. This is a bug deep inside the scanner whereby the first
> peek() of a row is done at time T then the rest of the read is done at T+1
> after a flush, thus the memstoreTS data is lost, and previously 'uncommitted'
> data becomes committed and flushed to disk.
> One possible solution is to introduce the memstoreTS (or similarly equivalent
> value) to the HFile thus allowing us to preserve read consistency past
> flushes. Another solution involves fixing the scanners so that peek() is not
> destructive (and thus might return different things at different times alas).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.