Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16648 )
Change subject: KUDU-2612: restrict TxnStatusManager calls to be made by the leader only ...................................................................... Patch Set 6: (27 comments) A bunch of nits but overall looking much better! http://gerrit.cloudera.org:8080/#/c/16648/5/src/2.patch File src/2.patch: PS5: Remove this. http://gerrit.cloudera.org:8080/#/c/16648/5/src/3.patch File src/3.patch: PS5: Remove this. http://gerrit.cloudera.org:8080/#/c/16648/5/src/kudu/integration-tests/txn_status_manager-itest.cc File src/kudu/integration-tests/txn_status_manager-itest.cc: http://gerrit.cloudera.org:8080/#/c/16648/5/src/kudu/integration-tests/txn_status_manager-itest.cc@447 PS5, Line 447: SleepFor(MonoDelta::FromMilliseconds(5 * keepalive_interval_ms)); Do we need this if we're repeating in ASSERT_EVENTUALLY? http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/integration-tests/txn_status_table-itest.cc File src/kudu/integration-tests/txn_status_table-itest.cc: http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/integration-tests/txn_status_table-itest.cc@744 PS6, Line 744: ; nit: drop extra semicolon http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/integration-tests/txn_status_table-itest.cc@751 PS6, Line 751: threads.emplace_back([&] { What does the multithreading add here? Seems like it complicates things since we'd have to worry about monotonically increasing transaction IDs. http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/integration-tests/txn_status_table-itest.cc@758 PS6, Line 758: if (s.ok()) { What failures are we expecting? http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/integration-tests/txn_status_table-itest.cc@748 PS6, Line 748: std::atomic<int64_t> next_thread_id = 0; : vector<int> txn_id_index = { 1, 6, 11, 16, 21}; : for (int t = 0; t < 5; t++) { : threads.emplace_back([&] { : int thread_id = next_thread_id++; : CHECK_LT(thread_id, txn_id_index.size()); : int start_txn_id = txn_id_index[thread_id]; : for (int i = start_txn_id; i < start_txn_id + 5; i++) { : TxnStatusEntryPB txn_status; : Status s = txn_sys_client_->BeginTransaction(i, kUser); : if (s.ok()) { : // Make sure a re-election happens before the following read. : SleepFor(MonoDelta::FromMilliseconds(3 * FLAGS_raft_heartbeat_interval_ms)); : s = txn_sys_client_->GetTransactionStatus(i, kUser, &txn_status); : CHECK_OK(s); : CHECK(txn_status.has_user()); : CHECK_STREQ(kUser, txn_status.user().c_str()); : CHECK(txn_status.has_state()); : CHECK_EQ(TxnStatePB::OPEN, txn_status.state()); : } : } : }); : } nit: would be easier to read as: const int kTxnsPerThread = 5; const int kNumThreads = 5; vector<threads> threads; for (int t = 0; t < kNumThreads; t++) { threads.emplace_back([t, &] { for (int i = 0; i < kNumTxnsPerThreads; i++) { // do stuff on txn (t * kNumTxnsPerThreads + i) } } } http://gerrit.cloudera.org:8080/#/c/16648/5/src/kudu/tablet/tablet_replica.cc File src/kudu/tablet/tablet_replica.cc: http://gerrit.cloudera.org:8080/#/c/16648/5/src/kudu/tablet/tablet_replica.cc@395 PS5, Line 395: "Received notification of TxnStatusTablet state change " : << "but the raft consensus is not initialized. Tablet ID: " : << tablet_id << ". Reason: " << reason; nit: use Substitute. Same below. http://gerrit.cloudera.org:8080/#/c/16648/5/src/kudu/tablet/tablet_replica.cc@407 PS5, Line 407: << nit: spacing alignment http://gerrit.cloudera.org:8080/#/c/16648/5/src/kudu/tablet/tablet_replica.cc@997 PS5, Line 997: state_ == STOPPING || state_ == STOPPED I'm curious whether the code would work if we reused the != RUNNING condition. If so, we could use the same function at both callsites. http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/tablet/tablet_replica.cc File src/kudu/tablet/tablet_replica.cc: http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/tablet/tablet_replica.cc@403 PS6, Line 403: << nit: maybe add a message here too indicating what's going on http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/tablet/tablet_replica.cc@472 PS6, Line 472: LOG(WARNING) << Substitute("The tablet is already shutting down or shutdown. State: $0", : TabletStatePB_Name(state_)); nit: might be cleaner to let the callers of these otherwise simple functions handle logging. http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/tablet/tablet_replica.cc@979 PS6, Line 979: if (!txn_coordinator_) { : return false; : } nit: maybe only call this if txn_coordinator_ is set and change this to a DCHECK? Same below. http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/tablet/tablet_replica.cc@997 PS6, Line 997: state_ == STOPPING || state_ == STOPPED) { nit: I'm curious, why not reuse the != RUNNING condition from above? A comment would be nice. http://gerrit.cloudera.org:8080/#/c/16648/5/src/kudu/tablet/txn_coordinator.h File src/kudu/tablet/txn_coordinator.h: http://gerrit.cloudera.org:8080/#/c/16648/5/src/kudu/tablet/txn_coordinator.h@50 PS5, Line 50: It's about reload tablet metadata nit: how about "This task loads the contents of the tablet into memory and does other work..." http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/transactions/txn_status_manager.h File src/kudu/transactions/txn_status_manager.h: http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/transactions/txn_status_manager.h@109 PS6, Line 109: but some operations : // may still be legal. I might be missing something but is this fact used anywhere? What operations are still legal? http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/transactions/txn_status_manager.h@106 PS6, Line 106: const Status& replica_status() const { return replica_status_; } : : // Leadership status of the transaction status manager. If not OK, the : // transaction status manager is not the leader, but some operations : // may still be legal. : const Status& leader_status() const { : return leader_status_; : } nit: Are these ever used? http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/transactions/txn_status_manager.h@118 PS6, Line 118: if (!replica_status_.ok()) { : return replica_status_; : } : return leader_status_; nit: a bit simpler as RETURN_NOT_OK(replica_status_); return leader_status_; http://gerrit.cloudera.org:8080/#/c/16648/5/src/kudu/transactions/txn_status_manager.h File src/kudu/transactions/txn_status_manager.h: http://gerrit.cloudera.org:8080/#/c/16648/5/src/kudu/transactions/txn_status_manager.h@230 PS5, Line 230: // data is loaded and the replic nit: maybe call it LoadFromTabletForTests() then http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/transactions/txn_status_manager.cc File src/kudu/transactions/txn_status_manager.cc: http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/transactions/txn_status_manager.cc@83 PS6, Line 83: If this time is exceeded, the " : "node crashes." If the consequence is so severe, maybe we should set this to be something like 5 minutes? Or, rather than crashing, why not just return an error to the client? Or would it be better to just wait? http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/transactions/txn_status_manager.cc@217 PS6, Line 217: leader_ready_term != initial_term_ || : !leader_shared_lock_.owns_lock()) nit: Curious, doesn't this mean that we've changed leaders? The error message doesn't seem to reflect that. http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/transactions/txn_status_manager.cc@235 PS6, Line 235: NOT_THE_LEADER Should this only be set if leader_status_ is set? http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/transactions/txn_status_manager.cc@320 PS6, Line 320: RETURN_NOT_OK nit: just return this? http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/transactions/txn_status_manager.cc@369 PS6, Line 369: } nit: maybe log something about beginning to wait? http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/transactions/txn_status_manager.cc@414 PS6, Line 414: failed nit: "interrupted" would be less alarming, given this is somewhat normal http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/transactions/txn_status_manager.cc@430 PS6, Line 430: Substitute("Tablet") nit: remove http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/tserver/ts_tablet_manager.cc@1428 PS6, Line 1428: RunTx nit: maybe name these ShouldRun... Run...Task makes it seem like the functions actually run the task, rather than determine whether we should run. -- To view, visit http://gerrit.cloudera.org:8080/16648 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I42c1ad095dcb4bdffcbe0ecf9631a60bac208c2a Gerrit-Change-Number: 16648 Gerrit-PatchSet: 6 Gerrit-Owner: Hao Hao <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Hao Hao <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Comment-Date: Tue, 12 Jan 2021 23:52:42 +0000 Gerrit-HasComments: Yes
