[ 
https://issues.apache.org/jira/browse/HBASE-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899119#action_12899119
 ] 

ryan rawson commented on HBASE-2856:
------------------------------------

This issue is pernicious and difficult.  I have attempted many times to nail 
this but there continue to be issues.

Right now there are at least 2 issues:
- When a memstore is flushed to HFile the 'memstoreTS' generation stamp is 
dropped.  With the way the scanners work right now, this causes some minor 
issue.  A row is really composed on a sequence of KeyValues, with different 
memstoreTS values, so you can imagine right before the flush the scanner looks 
sort of like so:

                           scanner points here
(new gen here) |
K K K K K K K K K K K K K K K K K K K K K  (keys for 1 row)
                                    (new and old gen)

The new gen have newer memstoreTS value and were ignored during the next().  So 
the scanner is pointing to a Key part of the way through a row.   When the 
memstore is pushed to a HFile, we mistakenly ignore all the skipped new gen 
keys, and we end up with the FIRST KeyValue from the older gen, and the 
subsequent columns from the newer gen, ending up with a mixed generation 
result, thus looking like non-atomic put.

One solution to this problem is to take the top Key during a scanner reset 
(done after a switch between memstore -> hfile) and transform it into 
KeyValue.createFirstOnRow(). There has to be exception code for partial row 
fetches where the scanner gets partial row results each time (Because the row 
is too big).

- The second issue is that of scanner updates and multi-column families.  Right 
now we do code sort of like this:
for each store (aka: family) in region:
   switch memstore -> hfile

For each Store/family the switch between memstore->hfile is done atomically 
with respect to scanners, that is for a scanner they either access the old 
snapshot and old files or the new null snapshot and the new files.  But this 
atomic snapshot is done at a Store at-a-time level, meaning that concurrent 
scanners might see some data in one store from memstore (complete with 
generation timestamps) and some data in another store all from hfile (with no 
timestamps), thus again ending up with mixed generation row data that violates 
row atomicity.

One possible solution to this problem is to move scanner updates to the 
HRegion/RegionScanner level.  This is not as easy as it sounds from a code 
structure point of view.  Furthermore by using larger update blocks we may be 
introducing performance issues surrounding scanner updates. 


-----------

The ultimate problem here is these two problems are just the most recently 
identified bugs in a long list of bugs.  We can fix them in the ways described 
above, but what about the next series of bugs?  We are plugging leaks in a dam. 
 A better approach would be to build a new, better, dam, only with the 
knowledge we have about what previously went wrong.

------------

The proposal is thusly:
- Add a new generation timestamp (or perhaps we could call it transaction id) 
to every KeyValue.  This generation timestamp is also serialized and stored 
inside HFile.  
- The generation timestamp starts at 1 for brand new tables.  Every write 
operation increments this by 1, much like ReadWriteConsistencyControl
- KeyValues with ts=0 belong to no generation and are always included in every 
read operation
- KeyValues with ts>0 belong to that specific generation and the reader should 
include them only if their own read point is larger. The reader obtains the 
read point at scanner start time, and may possibly update it between rows if 
the implementation can do so. (As per the current RWCC algorithms)
- Bulk imported hfiles have all their TS=0 (the default value)
- Modifications set their ts=X as per the current RWCC algorithms.
- During hfile flush the largest TS is written to the HFile metadata.
- During HRegion/Store init, the largest TS is obtained from HFile metadata, 
and the RWCC is set to this value.

Some predicted questions:
- What about the extra space? 8 bytes is a lot.  
-- With vint encoding and that the values start at 1, we should be able to 
shave a lot of space off.  We will also have to store a length field with 1 
extra byte. 2 bytes minimum for '0'.
- What about vint decode speed?
-- It may not be an issue, we need to profile.  If it is, we can trade off RAM 
space in KeyValue to cache this.
- What about file migration? I dont want to take a long time to upgrade my 
cluster!
-- By creating a new metadata key, call it KEYVALUE_VERSION, the Store layer on 
top of HFile can parse the older KeyValues from HFile (which wont have this 
metadata), substituting '0' for the missing timestamp. 
-- Newer HFiles will use the newer KeyValue parse code to obtain the generation 
timestamp.
- This is a lot of cost just to fix something that seems we should know in 
memory
-- Yes it isn't cheap, but this problem has proven to be difficult to solve. By 
addressing the issue directly we can have confidence in our design and the 
existing test frameworks which are failing should validate it for us.
- What about splits?
-- Each split region will 'take off' from the maximal generation timestamp and 
evolve independently.  If they were to be merged together that'd be ok, each 
region would have to close, halting all writes, then take the largest 
generation timestamp and continue evolving from there.  The timestamp exists to 
differentiate committed and non-committed data, and during a close by 
definition all values in the HFile would be committed.
- Ok so I'm paying at least 2 bytes per KeyValue, can we get better return on 
investment?
-- Yes!  By using the generation timestamp we can detect the situation where a 
previously inserted Delete is masking new Puts.  We can switch between the 
existing behaviour and newer behaviour at runtime and ignore the rogue Delete 
marker.
-- There may be other benefits as well. Providing a partial total order to all 
the KeyValue inserts and grouped by transaction would provide useful debugging 
info, for both HBase and client bugs.
- What about read isolation?
-- With this we can definitely implement at-scanner-open read isolation.  Due 
to the nature of the scanner design advancing the generation timestamp read 
point is more difficult, but it may be possible.




> TestAcidGuarantee broken on trunk 
> ----------------------------------
>
>                 Key: HBASE-2856
>                 URL: https://issues.apache.org/jira/browse/HBASE-2856
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.89.20100621
>            Reporter: ryan rawson
>            Assignee: ryan rawson
>            Priority: Blocker
>             Fix For: 0.90.0
>
>
> TestAcidGuarantee has a test whereby it attempts to read a number of columns 
> from a row, and every so often the first column of N is different, when it 
> should be the same.  This is a bug deep inside the scanner whereby the first 
> peek() of a row is done at time T then the rest of the read is done at T+1 
> after a flush, thus the memstoreTS data is lost, and previously 'uncommitted' 
> data becomes committed and flushed to disk.
> One possible solution is to introduce the memstoreTS (or similarly equivalent 
> value) to the HFile thus allowing us to preserve read consistency past 
> flushes.  Another solution involves fixing the scanners so that peek() is not 
> destructive (and thus might return different things at different times alas).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to