[
https://issues.apache.org/jira/browse/HBASE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845723#action_12845723
]
Yoram Kulbak commented on HBASE-2294:
-------------------------------------
[Ryan] One point of discussion, I think it's important to have a scanner stay
'up to date' as much as possible
[Todd] Also +1, I think the current compromise makes sense - don't see partial
row mutations, but at the beginning of each row, the freshest data is taken.
[Stack] + On 'XXX: Ryan has mentioned the model of "scans will always get the
most up-to-date version of a row when beginning a new row". Do we want to
guarantee this or just leave it at "some version of the row at least as new as
what existed at scan start"?', I'm fine with either. The latter seems an easier
guarantee.
I'm basing my comment on HBASE-2248 playing a major factor in enforcing the
ACID properties of HBase.
IMHO having the scanner stay 'up to date' as much as possible is a
nice-to-have, definitely not important enough to hurt performance. A quick look
at the suggested patch for HBASE-2248 reveals that in order to enforce the rule
above the memstore scanner was reverted to using the 0.20.2-style
ConcurrentSkipListSet#tailSet operation. Our experiments on 0.20.2 showed that
with this style of memstore scanning it's actually 3 times slower to scan the
memstore than it is to scan the store files (with block cache enabled).
Also, (assuming that the HBASE-2248 patch is committed) I don't see any point
in a 'best effort' guarantee: e.g. since from the user's perspective "'up to
date' as much as possible" is not clearly defined it's better to guarantee the
clear-cut notion of seeing your own writes since it leaves leeway for future
performance tweaks.
I haven't performance tested any of the suggested patches for HBASE-2248, but
it seems like PE is going to be performed soon. My guess is that if the PE
numbers will be compared to the existing baseline it may show a slow-down.
I'm not familiar with the wide range of use-cases for HBASE but my experience
is that usually scanning through a single region takes less than a second.
Every time the client scanner moves to a new region a new region scanner is
instantiated (which grabs the latest 'region state') and so in most cases, the
client scanner will encounter rows which are at most a couple of seconds old.
Slower scans will usually be due to the client side performing some lengthy
operations during the scan. I would think that clients which do 'lengthy scans'
don't particularly care about performance and hence, if they wish to make the
best effort to process up-to-date rows they can issue a GET for every row
before they process it. For most cases, I would expect a row which is at most a
couple of seconds old to be good enough.
> Enumerate ACID properties of HBase in a well defined spec
> ---------------------------------------------------------
>
> Key: HBASE-2294
> URL: https://issues.apache.org/jira/browse/HBASE-2294
> Project: Hadoop HBase
> Issue Type: Task
> Components: documentation
> Reporter: Todd Lipcon
> Priority: Blocker
> Fix For: 0.20.4, 0.21.0
>
>
> It's not written down anywhere what the guarantees are for each operation in
> HBase with regard to the various ACID properties. I think the developers know
> the answers to these questions, but we need a clear spec for people building
> systems on top of HBase. Here are a few sample questions we should endeavor
> to answer:
> - For a multicell put within a CF, is the update made durable atomically?
> - For a put across CFs, is the update made durable atomically?
> - Can a read see a row that hasn't been sync()ed to the HLog?
> - What isolation do scanners have? Somewhere between snapshot isolation and
> no isolation?
> - After a client receives a "success" for a write operation, is that
> operation guaranteed to be visible to all other clients?
> etc
> I see this JIRA as having several points of discussion:
> - Evaluation of what the current state of affairs is
> - Evaluate whether we currently provide any guarantees that aren't useful to
> users of the system (perhaps we can drop in exchange for performance)
> - Evaluate whether we are missing any guarantees that would be useful to
> users of the system
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.