[ 
https://issues.apache.org/jira/browse/HBASE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846784#action_12846784
 ] 

Todd Lipcon commented on HBASE-2294:
------------------------------------

bq. Clocks, anecdotally, do progress at different rates

Certainly within a bounded amount of error for practical systems, though - we 
could set this error as high as 50%, and on the time scales we're talking about 
I dont think one node will think 5 seconds passed while another thinks 10.

bq. Also you would have to ensure that you read and test the clock atomically 
with the update,

I don't think so, in this case. Let me try to work through this somewhat [maybe 
overly] formally (mostly to convince myself too!)

We have the following events:

1) node A reads its timestamp as T1
2) node A sends a sync() message to ZK
2a) ZK receives sync() method and responds
3) node A receives success from sync (ie things have been sunk)
3a) concurrently at some point, node A loses its connection to ZK (network 
partition or some such)
4) client C sends a request to node A (call this T2)
5) node A receives a request from client C (call this T3)
6) node A responds to C
7) C receives response
8) ZK times out session to A (call this T4)
[note that this sequence above isn't a defined ordering]

Looking at "happens-before" relations, we know the following easily:
- 1 < 2 < 3 < 5 < 6  (these are all seen by A in this order, so we know it to 
be true)
- 3 < 3a (the connection must have been up when we received success)
- 1 < 2a < 3 (causal)
- 4 < 5 < 6 < 7 (causal)

Let's say that ZK will time out a node it hasn't heard from in Z seconds. From 
Z's perspective, then, step 8 occurs at least Z seconds after step 2a. Since 
step 1 happens before 2a (see above), we know that step 8 happens at least Z 
seconds after step 1. If we assume that ZK's clock progresses at some error 
ratio of A's clock, then step 8 happens at  Z*errorRatio after it received the 
sync. It received the sync (2a) some unknown amount of time after T1 due to 
latency. So T4 from A's perspective = T1 + Z*errorRatio + latency. That is, as 
long as we are within Z*errorRatio seconds after _sending_ our last ZK message, 
we are "in the clear" that no one else has decided we're dead.

Back to the problem at hand, to avoid "time travel" reads, what we need to do 
is make sure that when we _initiate_ the read from a client, the target region 
server is still holding the region (ie 4 happens before 8). We already know 4 
happens before 5, so if 5 happens before 8, that's a stronger condition. We 
know step 5 happens before 8 if T3 < T4. We decided T4 > T1 + Z*errorRatio + 
latency. So if T3 < T1 + Z*errorRatio + latency, we are good to go. We don't 
know latency, but it's always positive so it only helps us.

Does this sound correct?

> Enumerate ACID properties of HBase in a well defined spec
> ---------------------------------------------------------
>
>                 Key: HBASE-2294
>                 URL: https://issues.apache.org/jira/browse/HBASE-2294
>             Project: Hadoop HBase
>          Issue Type: Task
>          Components: documentation
>            Reporter: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.20.4, 0.21.0
>
>
> It's not written down anywhere what the guarantees are for each operation in 
> HBase with regard to the various ACID properties. I think the developers know 
> the answers to these questions, but we need a clear spec for people building 
> systems on top of HBase. Here are a few sample questions we should endeavor 
> to answer:
> - For a multicell put within a CF, is the update made durable atomically?
> - For a put across CFs, is the update made durable atomically?
> - Can a read see a row that hasn't been sync()ed to the HLog?
> - What isolation do scanners have? Somewhere between snapshot isolation and 
> no isolation?
> - After a client receives a "success" for a write operation, is that 
> operation guaranteed to be visible to all other clients?
> etc
> I see this JIRA as having several points of discussion:
> - Evaluation of what the current state of affairs is
> - Evaluate whether we currently provide any guarantees that aren't useful to 
> users of the system (perhaps we can drop in exchange for performance)
> - Evaluate whether we are missing any guarantees that would be useful to 
> users of the system

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to