[ 
https://issues.apache.org/jira/browse/HBASE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845723#action_12845723
 ] 

Yoram Kulbak commented on HBASE-2294:
-------------------------------------

[Ryan] One point of discussion, I think it's important to have a scanner stay 
'up to date' as much as possible

[Todd] Also +1, I think the current compromise makes sense - don't see partial 
row mutations, but at the beginning of each row, the freshest data is taken.

[Stack] + On 'XXX: Ryan has mentioned the model of "scans will always get the 
most up-to-date version of a row when beginning a new row". Do we want to 
guarantee this or just leave it at "some version of the row at least as new as 
what existed at scan start"?', I'm fine with either. The latter seems an easier 
guarantee.

I'm basing my comment on HBASE-2248 playing a major factor in enforcing the 
ACID properties of HBase. 

IMHO having the scanner stay 'up to date' as much as possible is a 
nice-to-have, definitely not important enough to hurt performance. A quick look 
at the suggested patch for HBASE-2248 reveals that in order to enforce the rule 
above the memstore scanner was reverted to using the 0.20.2-style 
ConcurrentSkipListSet#tailSet operation. Our experiments on 0.20.2 showed that 
with this style of memstore scanning it's actually 3 times slower to scan the 
memstore than it is to scan the store files (with block cache enabled). 
Also, (assuming that the HBASE-2248 patch is committed) I don't see any point 
in a 'best effort' guarantee: e.g. since from the user's perspective "'up to 
date' as much as possible" is not clearly defined it's better to guarantee the 
clear-cut notion of seeing your own writes since it leaves leeway for future 
performance tweaks.

I haven't performance tested any of the suggested patches for HBASE-2248, but 
it seems like PE is going to be performed soon. My guess is that if the PE 
numbers will be compared to the existing baseline it may show a slow-down.
  
I'm not familiar with the wide range of use-cases for HBASE but my experience 
is that usually scanning through a single region takes less than a second. 
Every time the client scanner moves to a new region a new region scanner is 
instantiated (which grabs the latest 'region state') and so in most cases, the 
client scanner will encounter rows which are at most a couple of seconds old.   
Slower scans will usually be due to the client side performing some lengthy 
operations during the scan. I would think that clients which do 'lengthy scans' 
don't particularly care about performance and hence, if they wish to make the 
best effort to process up-to-date rows they can issue a GET for every row 
before they process it. For most cases, I would expect a row which is at most a 
couple of seconds old to be good enough.   


> Enumerate ACID properties of HBase in a well defined spec
> ---------------------------------------------------------
>
>                 Key: HBASE-2294
>                 URL: https://issues.apache.org/jira/browse/HBASE-2294
>             Project: Hadoop HBase
>          Issue Type: Task
>          Components: documentation
>            Reporter: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.20.4, 0.21.0
>
>
> It's not written down anywhere what the guarantees are for each operation in 
> HBase with regard to the various ACID properties. I think the developers know 
> the answers to these questions, but we need a clear spec for people building 
> systems on top of HBase. Here are a few sample questions we should endeavor 
> to answer:
> - For a multicell put within a CF, is the update made durable atomically?
> - For a put across CFs, is the update made durable atomically?
> - Can a read see a row that hasn't been sync()ed to the HLog?
> - What isolation do scanners have? Somewhere between snapshot isolation and 
> no isolation?
> - After a client receives a "success" for a write operation, is that 
> operation guaranteed to be visible to all other clients?
> etc
> I see this JIRA as having several points of discussion:
> - Evaluation of what the current state of affairs is
> - Evaluate whether we currently provide any guarantees that aren't useful to 
> users of the system (perhaps we can drop in exchange for performance)
> - Evaluate whether we are missing any guarantees that would be useful to 
> users of the system

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to