David Ribeiro Alves has posted comments on this change. Change subject: KUDU-798 (part 5) Correct safe time advancement ......................................................................
Patch Set 32: (35 comments) http://gerrit.cloudera.org:8080/#/c/5240/32//COMMIT_MSG Commit Message: PS32, Line 31: whole > hole doh PS32, Line 56: it > is Done PS32, Line 52: : - TimeManager now does a pre-flight check before waiting on safe time. : In particular it checks that: i) it has heard from the leader within : a configurable amount of time (that safe time _can_ make progress). : ii) it checks that the safe time it not more that a configurable : amount of time in the past, 30 seconds by default (that safe time : is likely to make progress to the required timestamp). > I haven't looked at the code yet, but one question here: these checks shoul doing this already, yeah :) PS32, Line 65: unique > any particular reason why not the one with dup keys too? because the duplicate keys one only has 20 different rows and the test works with row counts. PS32, Line 71: workloaf > typo Done http://gerrit.cloudera.org:8080/#/c/5240/32/src/kudu/client/client-test-util.cc File src/kudu/client/client-test-util.cc: Line 58: ASSERT_OK(scanner.SetSelection(KuduClient::LEADER_ONLY)); > is LEADER_ONLY still relevant here? or should this be changed to set it to left a TODO think we should cleanup this retry code. http://gerrit.cloudera.org:8080/#/c/5240/32/src/kudu/client/client-test.cc File src/kudu/client/client-test.cc: Line 1147: TEST_F(ClientTest, TestScanFaultTolerance) { > did you loop this test too? I had looped it quite a bit cuz it was flaky, but forgot to keep the results. Just looped it 1000 more. 1000 passes :) Updated the commit message. http://gerrit.cloudera.org:8080/#/c/5240/32/src/kudu/consensus/consensus_queue.cc File src/kudu/consensus/consensus_queue.cc: Line 456: KLOG_EVERY_N_SECS(WARNING, 10) << "Safe time advancement without writes is disabled. " > hrm, maybe FIRST_K is better here? not sure seeing it once every 10 seconds made it minutes Done http://gerrit.cloudera.org:8080/#/c/5240/32/src/kudu/consensus/raft_consensus.cc File src/kudu/consensus/raft_consensus.cc: Line 1238: if (request->has_safe_timestamp()) { > I'm still not sure this is in the right spot if the tail of the messages in yeah I agree that we shouldn't take the safe timestamp into account if we failed to prepare. left a todo. http://gerrit.cloudera.org:8080/#/c/5240/32/src/kudu/consensus/time_manager.cc File src/kudu/consensus/time_manager.cc: PS32, Line 31: safe_time_max_last_advanced_safe_time > not a big fan of this name. How about: missed_heartbeats_before_rejecting_s Done PS32, Line 33: snapshot scans > scans at snapshots that aren't yet safe Done Line 34: TAG_FLAG(safe_time_max_last_advanced_safe_time, advanced); > let's mark this unstable for now too Done PS32, Line 37: lag > lag behind the requested timestamp? Done Line 39: TAG_FLAG(safe_time_max_lag_ms, advanced); > same Done PS32, Line 151: $0 secs, (max is $1 sec > 'secs' is redundant here since MonoDelta::ToString already has a 's' suffix Done Line 166: safe_time_diff.ToMilliseconds(), > wrong units here (message says secs) Done Line 167: FLAGS_safe_time_max_lag_ms); > same Done Line 173: void TimeManager::MakeWaiterTimeoutMessageUnlocked(Timestamp timestamp, string* error_message) { > probably easier to just return string from this method instead of the out-p yeah, it's ugly so I tried to hide it. moved back. Line 188: if (mode_ == NON_LEADER && timestamp > last_safe_ts_) { > ah ok, so you are doing the thing that I asked about in the commit message. Done http://gerrit.cloudera.org:8080/#/c/5240/32/src/kudu/consensus/time_manager.h File src/kudu/consensus/time_manager.h: PS32, Line 117: is > if Done http://gerrit.cloudera.org:8080/#/c/5240/32/src/kudu/integration-tests/raft_consensus-itest.cc File src/kudu/integration-tests/raft_consensus-itest.cc: Line 2345: "--maintenance_manager_polling_interval_ms=300", > typo? Done Line 2664: //workload.set_num_read_threads(2); > hrm? Done http://gerrit.cloudera.org:8080/#/c/5240/32/src/kudu/integration-tests/test_workload.cc File src/kudu/integration-tests/test_workload.cc: Line 270: start_latch_.Reset((uint64_t)num_write_threads_ + (uint64_t)num_read_threads_); > nit: why these casts? cuz tidy bot complains otherwise. removed http://gerrit.cloudera.org:8080/#/c/5240/32/src/kudu/tools/tool_action_remote_replica.cc File src/kudu/tools/tool_action_remote_replica.cc: Line 100: // and thus cannot do snapshot scans. > I think ORDERED is still useful here. the two were mashed together when we added (or after) fault tolerance. now the the tablet service will return an error if ordered is set but not snapshot. I agree that order is important but not being able to dump a server data if its alone is worse imo. Left a TODO for when we have another read mode. http://gerrit.cloudera.org:8080/#/c/5240/32/src/kudu/tserver/tablet_service.cc File src/kudu/tserver/tablet_service.cc: PS32, Line 84: time > 'of time (in milliseconds)' and remove 'in milliseconds' from the end of th Done PS32, Line 85: it's > its Done PS32, Line 85: allows to > 'allows us to' or 'allows waiting' Done Line 1329: ": has no lower or upper bound."); > lots of reindentation in this file. mind reverting that part of the patch? must have pressed the indent file key by mistake. Done PS32, Line 1387: Retunrs > typo Done PS32, Line 1388: CheckIfAncientHistory > rename to 'VerifyNotAncientHistory' or 'VerifyAfterAHM' or something. Other Done Line 1759: MonoTime ClampDeadline(const MonoTime& deadline, bool* was_clamped) { > rename to ClampScanDeadlineForWait or something so it's clearer where this Done PS32, Line 1816: ahm > nit: AHM Done PS32, Line 1842: allows to > allows us to Done PS32, Line 1850: s.ok() > !s.ok() is redundant here with s.IsTimedout() Done Line 1861: if (!s.ok() && s.IsTimedOut() && was_clamped) { > same Done -- To view, visit http://gerrit.cloudera.org:8080/5240 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e Gerrit-PatchSet: 32 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: David Ribeiro Alves <dral...@apache.org> Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com> Gerrit-Reviewer: David Ribeiro Alves <dral...@apache.org> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-HasComments: Yes