[ 
https://issues.apache.org/jira/browse/KUDU-2906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-2906.
---------------------------------
    Fix Version/s: 1.17.0
       Resolution: Implemented

The generic provision to detect system/local clock jump has been implemented 
with 
[555854178b9b498701619f4bb0dbbbbeab8e69e7|https://github.com/apache/kudu/commit/555854178b9b498701619f4bb0dbbbbeab8e69e7].
  By default, it's enabled only at Azure VM instances, but it's possible to 
turn it on anywhere: check the commit description for details.

With that, I'm resolving this JIRA item.

> Don't allow elections when server clocks are too out of sync
> ------------------------------------------------------------
>
>                 Key: KUDU-2906
>                 URL: https://issues.apache.org/jira/browse/KUDU-2906
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>    Affects Versions: 1.10.0
>            Reporter: Andrew Wong
>            Priority: Major
>             Fix For: 1.17.0
>
>
> In cases where machine clocks are not properly synchronized, if a tablet 
> replica is elected leader whose clock happens to be very far in the future 
> (greater than --max_clock_sync_error_usec=10 sec), it's possible that any 
> writes that goes to that tablet will be rejected by the followers, but 
> persisted to the leader's WAL.
> Then, upon fixing the clock on that machine, the replica may try to replay 
> the future op, but fail to replay it because the op timestamp is too far in 
> the future, with errors like:
> {code:java}
> F0715 12:03:09.369819  3500 tablet_bootstrap.cc:904] Check failed: _s.ok() 
> Bad status: Invalid argument: Tried to update clock beyond the max. 
> error.{code}
> Dumping a recovery WAL, I could see:
> {code:java}
> 130.138@6400743143334211584 REPLICATE NO_OP
> id { term: 130 index: 138 } timestamp: 6400743143334211584 op_type: NO_OP 
> noop_request { }
> COMMIT 130.138
> op_type: NO_OP commited_op_id { term: 130 index: 138 }
> 131.139@6400743925559676928 REPLICATE NO_OP
> id { term: 131 index: 139 } timestamp: 6400743925559676928 op_type: NO_OP 
> noop_request { }
> COMMIT 131.139
> op_type: NO_OP commited_op_id { term: 131 index: 139 }
> 132.140@11589864471731939930 REPLICATE NO_OP
> id { term: 132 index: 140 } timestamp: 11589864471731939930 op_type: NO_OP 
> noop_request { }{code}
> Note the drastic jump in timestamp.
> In this specific case, we verified that the replayed WAL wasn't that far 
> behind the recovery WAL, which had the future timestamps, so we could just 
> delete the recovery WAL and bootstrap from the replayed WAL.
> It would have been nice had those bad ops not been written at all, maybe by 
> preventing an election between such mismatched servers in the first place.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to