[ 
https://issues.apache.org/jira/browse/HBASE-14054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14623531#comment-14623531
 ] 

Tobi Vollebregt commented on HBASE-14054:
-----------------------------------------

Thinking about it more I think the most intuitive behavior would be if for all 
writes, HBase takes the {{max(last timestamp written for the row, 
System.currentTimeMillis())}} unless the timestamp is explicitly specified by 
the client. That seems the only way to guarantee monotonic writes for regular 
Puts too, not just checkAndPut, i.e. if you

- put X to row A
- put Y to row A

I expect that it is guaranteed that the write of Y "wins".

Currently, writes are only monotonic if the clock on the region server is 
monotonic. If it isn't, then even if these two writes are done sequentially 
from the same thread, there's a small chance that the first write wins, which 
seems counterintuitive.

I'm intentionally saying {{max(last timestamp written for the row, ...)}}, 
because I imagine that checking *all* timestamps for a row may be prohibitively 
expensive. And checking the last one written is sufficient to guarantee 
monotonic writes if the client is consistently specifies or does not specify 
timestamps.

If the client isn't consistent you there may still be invisible writes:

- put X to row A at timestamp far in the future, specified by client
- put Y to row A at current time, specified by client
- put Z to row A at current time assigned by HBase

Write Y and Z will be eclipsed by write X. I think that is acceptable given the 
API and the warnings in the documentation about assigning timestamps manually.

> Acknowledged writes may get lost if regionserver clock is set backwards
> -----------------------------------------------------------------------
>
>                 Key: HBASE-14054
>                 URL: https://issues.apache.org/jira/browse/HBASE-14054
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.98.6
>         Environment: Linux
>            Reporter: Tobi Vollebregt
>
> We experience a small amount of lost acknowledged writes in production on 
> July 1st (~700 identified so far).
> What happened was that we had NTP turned off since June 29th to prevent 
> issues due to the leap second on June 30th. NTP was turned back on July 1st.
> The next day, we noticed we were missing writes to a few of our higher 
> throughput aggregation tables.
> We found that this is caused by HBase taking the current time using 
> System.currentTimeMillis, which may be set backwards by NTP, and using this 
> without any checks to populate the timestamp of rows for which the client 
> didn't supply a timestamp.
> Our application uses a read-modify-write pattern using get+checkAndPut to 
> perform aggregation as follows:
> 1. read version 1
> 2. mutate
> 3. write version 2
> 4. read version 2
> 5. mutate
> 6. write version 3
> The application retries the full read-modify-write if the checkAndPut fails.
> What must have happened on July 1st, after we started NTP back up, was this 
> (timestamps added):
> 1. read version 1 (timestamp 10)
> 2. mutate
> 3. write version 2 (HBase-assigned timestamp 11)
> 4. read version 2 (timestamp 11)
> 5. mutate
> 6. write version 3 (HBase-assigned timestamp 10)
> Hence, the last write was eclipsed by the first write, and hence, an 
> acknowledged write was lost.
> While this seems to match documented behavior (paraphrasing: "if timestamp is 
> not specified HBase will assign a timestamp using System.currentTimeMillis" 
> "the row with the highest timestamp will be returned by get"), I think it is 
> very unintuitive and needs at least a big warning in the documentation, along 
> the lines of "Acknowledged writes may not be visible unless the timestamp is 
> explicitly specified and equal to or larger than the highest timestamp for 
> that row".
> I would also like to use this ticket to start a discussion on if we can make 
> the behavior better:
> Could HBase assign a timestamp of {{max(max timestamp for the row, 
> System.currentTimeMillis())}} in the checkAndPut write path, instead of 
> blindly taking {{System.currentTimeMillis()}}, similar to what has been done 
> in HBASE-12449 for increment and append?
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to