[
https://issues.apache.org/jira/browse/HBASE-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12981048#action_12981048
]
stack commented on HBASE-2856:
------------------------------
Just had good conversation with Ryan. We conclude that using the HLog sequence
number is NOT a good idea, mostly for performance reasons. Too many updates
will be stuck waiting on the completion of edits that may have started before
our update but that have yet to complete (we do not want to return to the
client until all transaction started before ours -- but that are slower than
ours to run -- have completed else there is the danger of not being able to
see what you have written). Instead, we need to keep a running sequence number
that is per HRegion rather than per HRegionServer as HLog sequence number is.
This new HRegion sequence number is very much like HLog sequence number in
that on open of HRegion we read in the largest and then increment from there.
Let me try and explain how we arrived at this notion.
We do ACID - - prevent readers reading part of an update -- by only letting
clients (scanners and gets) read stuff that has been fully committed.
Currently we do this by moving forward a monotonically increasing 'read
point'. Each update is given a write point. The read point is moved forward
to encompass all completed write points or 'transactions'. Transactions
complete willy-nilly but the read point will not move beyond the incomplete.
Here are the coarse steps involved in a 'transaction':
{code}
(0) row lock (Put, Increment, etc.)
(1) Go to WAL
(2) get new sequence id
(3) actually write WAL
(4) update memstore
(5) wait for our edit to be visible
(6) commit/move forward the read point
(7) undo rowlock
{code}
Up to this, the way we did 'ACID' was around memstore only. The readpoint is
kept up inside in an instance of RWCC. A RWCC instance is Region scoped (one
is created on creation of a HRegion). A new writepoint is created when we go
to write the memstore in step (4) above and then the readpoint is moved forward
to match the writepoint just before we do step (7) in the above. Currently our
RWCC transaction spans step (4) to (7) roughly.
"Wait to be visible" in the above means wait until all transactions that have
an id that is less than mine complete before I proceed to update the read point
and return to the client. A transaction that started before us may not complete
until after ours because of thread scheduling, hiccups, etc. We do not want to
move the read point forward until all updates previous to ours have completed
else we'll be letting clients read the incomplete earlier transactions.
Of note in the above, how long the WAL takes is not part of a RWCC transaction.
IF we move to using HLog sequence numbers, now the transaction starts at step
(1) when we go to the WAL. We'll need to update in RWCC the writepoint at step
(1). The HLog sequence number is for all of the region server, its not just
HRegion scoped. The 'wait for our edit to be visible' will be dependent now
on the completion on edits against unrelated HRegions whose character may be
completely different (e.g. the schema on HRegion A may be for increments
whereas the schema on HRegion B may be for fat batches of cells. If both are
on the same regionserver, the 'wait for our edit to be visible' may have the
increments waiting on the completion of a fat batch of updates).
So, the thought is instead to have a per region sequence number with the write
point updated only after we emerge from the WAL append. We keep the current
'transaction' scope where scope is between steps (4) and (7) in the above.
I'm going to go implement the per region edit number unless an alternative
suggested.
> TestAcidGuarantee broken on trunk
> ----------------------------------
>
> Key: HBASE-2856
> URL: https://issues.apache.org/jira/browse/HBASE-2856
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.89.20100621
> Reporter: ryan rawson
> Assignee: stack
> Priority: Blocker
> Fix For: 0.92.0
>
> Attachments: 2856-v2.txt, 2856-v3.txt, acid.txt
>
>
> TestAcidGuarantee has a test whereby it attempts to read a number of columns
> from a row, and every so often the first column of N is different, when it
> should be the same. This is a bug deep inside the scanner whereby the first
> peek() of a row is done at time T then the rest of the read is done at T+1
> after a flush, thus the memstoreTS data is lost, and previously 'uncommitted'
> data becomes committed and flushed to disk.
> One possible solution is to introduce the memstoreTS (or similarly equivalent
> value) to the HFile thus allowing us to preserve read consistency past
> flushes. Another solution involves fixing the scanners so that peek() is not
> destructive (and thus might return different things at different times alas).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.