[ 
https://issues.apache.org/jira/browse/HBASE-18432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Appy updated HBASE-18432:
-------------------------
    Attachment: HBASE-18432.HBASE-14070.HLC.002.patch

> Prevent clock from getting stuck after update()
> -----------------------------------------------
>
>                 Key: HBASE-18432
>                 URL: https://issues.apache.org/jira/browse/HBASE-18432
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Appy
>            Assignee: Appy
>         Attachments: HBASE-18432.HBASE-14070.HLC.001.patch, 
> HBASE-18432.HBASE-14070.HLC.002.patch
>
>
> There were a [bunch of 
> problems|https://issues.apache.org/jira/browse/HBASE-14070?focusedCommentId=16094013&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16094013]
>  (also copied below) with clock getting stuck after call to update() until 
> it's own system time caught up.
> ----
> PT = physical time, LT = logical time, ST = system time, X = don't care terms
> ----
> Core issue:
> - Note that in current implementation, we are passing master clock to RS in 
> open/close region request and RS clock to master in the responses. And they 
> both update their own time on receiving these request/response.
> - On receiving a clock ahead of its own, they update their own clock to its 
> PT+LT, and keep increasing LT till their own ST catches that PT.
> ----
> Proposed solution:
> Keep track of skew in clock. And instead of keeping track of physical time, 
> always compute it by adding system time and skew.
> On update(), recalculate skew and validate if it's greater than max_skew.
> On toTimestamp(), calculate PT = ST+skew.
> -----
> -----
> Issues with current approach:
> ----
> Problem 1: Logical time window too small.
> RS clock (10, X)
> Master clock (20, X)
> Master --request-> RS
> RS clock (20, X)
> While RS's physical java clock (which is backing up physical component of hlc 
> clock) will still take 10 sec to catch up, we'll keep incrementing logical 
> component. That means, in worst case, our logical clock window should be big 
> enough to support all the events that can happen in max skew time.
> The problem is, that doesn't seem to be the case. Our logical window is 1M 
> events (20bits) and max skew time is 30 sec, that results in 33k max write 
> qps, which is quite low. We can easily see 150k update qps per beefy server 
> with 1k values.
> Even 22 bits won't be enough. We'll need minimum of 23 bits and 20 sec max 
> skew time to support ~420k max events per second in worst case clock skew.
> ----
> Problem 2: Cascading logical time increment.
> When more RS are involved say - 3 RS and 1 master. Let's say max skew is 30 
> sec.
> HLC Clocks (physical time, logical time): X = don't care
> RS1: (50, 100k)
> Master: (40, X)
> RS2: (30, X)
> RS3: (20, X) 
> [RS3's ST behind RS1's by 30 sec.]
> RS1 replies to master, sends it's clock (50,X).
> Master's clock (50, X). It'll be another 10 sec before it's own physical 
> clock reaches 50, so HLC's PT will remain 50 for next 10 sec.
> Master --> RS2
> RS2's clock = (50, X).
> RS2 keeps incrementing LT on writes (since it's own PT is behind) for few 
> seconds before it replies back to master with (50, X+ few 100k).
> Master's clock = (50, X+ few 100k) [Since master's physical clock hasn't 
> caught up yet, note that it was 10 seconds behind, PT remains 50.].
> Master --> RS3
> RS3's clock (50, X+few 100k) 
> But RS3's ST is behind RS1's ST by 30 sec, which means it'll keep 
> incrementing LT for next 30 sec (unless it gets a newer clock from master).
> But the problem is, RS3 has much smaller LT window than actual 1M!!
> —
> Problem 3: Single bad RS clock crashing the cluster:
> If a single RS's clock is bad and a bit faster, it'll catch time and keep 
> pulling master's PT with it. If 'real time' is say 20, max skew time is 10, 
> and bad RS is at time 29.9, it'll pull master to 29.9 (via next response), 
> and then any RS less than 19.9, i.e. just 0.1 sec away from real time will 
> die due to higher than max skew.
> This can bring whole clusters down!
> —
> Problem 4: Time jumps (not a bug, but more of a nuisance)
> Say a RS is behind master by 20 sec. On each communication from master, RS 
> will update its own PT to master's PT, and it'll remain that till RS's ST 
> catches up. If there are frequent communication from master, ST might never 
> catch up and RS's PT will actually look like discrete time jumps rather than 
> continuous time.
> For eg. If master communicated with RS at times 30, 40, 50 (RSs corresponding 
> times are 10, 20, 30), than all events on RS between time [10, 50] will be 
> timestamped with either 30, 40 or 50.
> —



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to