Andrew Wong created KUDU-2906:
---------------------------------
Summary: Don't allow elections when server clocks are too out of
sync
Key: KUDU-2906
URL: https://issues.apache.org/jira/browse/KUDU-2906
Project: Kudu
Issue Type: Bug
Components: consensus
Affects Versions: 1.10.0
Reporter: Andrew Wong
In cases where machine clocks are not properly synchronized, if a tablet
replica is elected leader whose clock happens to be very far in the future
(greater than --max_clock_sync_error_usec=10 sec), it's possible that any
writes that goes to that tablet will be rejected by the followers, but
persisted to the leader's WAL.
Then, upon fixing the clock on that machine, the replica may try to replay the
future op, but fail to replay it because the op timestamp is too far in the
future, with errors like:
{code:java}
F0715 12:03:09.369819 3500 tablet_bootstrap.cc:904] Check failed: _s.ok() Bad
status: Invalid argument: Tried to update clock beyond the max. error.{code}
Dumping a recovery WAL, I could see:
{code:java}
130.138@6400743143334211584 REPLICATE NO_OP
id { term: 130 index: 138 } timestamp: 6400743143334211584 op_type: NO_OP
noop_request { }
COMMIT 130.138
op_type: NO_OP commited_op_id { term: 130 index: 138 }
131.139@6400743925559676928 REPLICATE NO_OP
id { term: 131 index: 139 } timestamp: 6400743925559676928 op_type: NO_OP
noop_request { }
COMMIT 131.139
op_type: NO_OP commited_op_id { term: 131 index: 139 }
132.140@11589864471731939930 REPLICATE NO_OP
id { term: 132 index: 140 } timestamp: 11589864471731939930 op_type: NO_OP
noop_request { }{code}
Note the drastic jump in timestamp.
In this specific case, we verified that the replayed WAL wasn't that far behind
the recovery WAL, which had the future timestamps, so we could just delete the
recovery WAL and bootstrap from the replayed WAL.
It would have been nice had those bad ops not been written at all, maybe by
preventing an election between such mismatched servers in the first place.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)