Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16648 )

Change subject: KUDU-2612: restrict TxnStatusManager calls to be made by the 
leader only
......................................................................


Patch Set 6:

(27 comments)

A bunch of nits but overall looking much better!

http://gerrit.cloudera.org:8080/#/c/16648/5/src/2.patch
File src/2.patch:

PS5:
Remove this.


http://gerrit.cloudera.org:8080/#/c/16648/5/src/3.patch
File src/3.patch:

PS5:
Remove this.


http://gerrit.cloudera.org:8080/#/c/16648/5/src/kudu/integration-tests/txn_status_manager-itest.cc
File src/kudu/integration-tests/txn_status_manager-itest.cc:

http://gerrit.cloudera.org:8080/#/c/16648/5/src/kudu/integration-tests/txn_status_manager-itest.cc@447
PS5, Line 447:   SleepFor(MonoDelta::FromMilliseconds(5 * 
keepalive_interval_ms));
Do we need this if we're repeating in ASSERT_EVENTUALLY?


http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/integration-tests/txn_status_table-itest.cc
File src/kudu/integration-tests/txn_status_table-itest.cc:

http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/integration-tests/txn_status_table-itest.cc@744
PS6, Line 744: ;
nit: drop extra semicolon


http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/integration-tests/txn_status_table-itest.cc@751
PS6, Line 751: threads.emplace_back([&] {
What does the multithreading add here? Seems like it complicates things since 
we'd have to worry about monotonically increasing transaction IDs.


http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/integration-tests/txn_status_table-itest.cc@758
PS6, Line 758:         if (s.ok()) {
What failures are we expecting?


http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/integration-tests/txn_status_table-itest.cc@748
PS6, Line 748:   std::atomic<int64_t> next_thread_id = 0;
             :   vector<int> txn_id_index = { 1, 6, 11, 16, 21};
             :   for (int t = 0; t < 5; t++) {
             :     threads.emplace_back([&] {
             :       int thread_id = next_thread_id++;
             :       CHECK_LT(thread_id, txn_id_index.size());
             :       int start_txn_id = txn_id_index[thread_id];
             :       for (int i = start_txn_id; i < start_txn_id + 5; i++) {
             :         TxnStatusEntryPB txn_status;
             :         Status s = txn_sys_client_->BeginTransaction(i, kUser);
             :         if (s.ok()) {
             :           // Make sure a re-election happens before the 
following read.
             :           SleepFor(MonoDelta::FromMilliseconds(3 * 
FLAGS_raft_heartbeat_interval_ms));
             :           s = txn_sys_client_->GetTransactionStatus(i, kUser, 
&txn_status);
             :           CHECK_OK(s);
             :           CHECK(txn_status.has_user());
             :           CHECK_STREQ(kUser, txn_status.user().c_str());
             :           CHECK(txn_status.has_state());
             :           CHECK_EQ(TxnStatePB::OPEN, txn_status.state());
             :         }
             :       }
             :     });
             :   }
nit: would be easier to read as:

const int kTxnsPerThread = 5;
const int kNumThreads = 5;
vector<threads> threads;
for (int t = 0; t < kNumThreads; t++) {
  threads.emplace_back([t, &] {
    for (int i = 0; i < kNumTxnsPerThreads; i++) {
      // do stuff on txn (t * kNumTxnsPerThreads + i)
    }
  }
}


http://gerrit.cloudera.org:8080/#/c/16648/5/src/kudu/tablet/tablet_replica.cc
File src/kudu/tablet/tablet_replica.cc:

http://gerrit.cloudera.org:8080/#/c/16648/5/src/kudu/tablet/tablet_replica.cc@395
PS5, Line 395: "Received notification of TxnStatusTablet state change "
             :                              << "but the raft consensus is not 
initialized. Tablet ID: "
             :                              << tablet_id << ". Reason: " << 
reason;
nit: use Substitute. Same below.


http://gerrit.cloudera.org:8080/#/c/16648/5/src/kudu/tablet/tablet_replica.cc@407
PS5, Line 407:   <<
nit: spacing alignment


http://gerrit.cloudera.org:8080/#/c/16648/5/src/kudu/tablet/tablet_replica.cc@997
PS5, Line 997: state_ == STOPPING || state_ == STOPPED
I'm curious whether the code would work if we reused the != RUNNING condition. 
If so, we could use the same function at both callsites.


http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/tablet/tablet_replica.cc
File src/kudu/tablet/tablet_replica.cc:

http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/tablet/tablet_replica.cc@403
PS6, Line 403: <<
nit: maybe add a message here too indicating what's going on


http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/tablet/tablet_replica.cc@472
PS6, Line 472: LOG(WARNING) << Substitute("The tablet is already shutting down 
or shutdown. State: $0",
             :                                TabletStatePB_Name(state_));
nit: might be cleaner to let the callers of these otherwise simple functions 
handle logging.


http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/tablet/tablet_replica.cc@979
PS6, Line 979:   if (!txn_coordinator_) {
             :     return false;
             :   }
nit: maybe only call this if txn_coordinator_ is set and change this to a 
DCHECK? Same below.


http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/tablet/tablet_replica.cc@997
PS6, Line 997: state_ == STOPPING || state_ == STOPPED) {
nit: I'm curious, why not reuse the != RUNNING condition from above? A comment 
would be nice.


http://gerrit.cloudera.org:8080/#/c/16648/5/src/kudu/tablet/txn_coordinator.h
File src/kudu/tablet/txn_coordinator.h:

http://gerrit.cloudera.org:8080/#/c/16648/5/src/kudu/tablet/txn_coordinator.h@50
PS5, Line 50: It's about reload tablet metadata
nit: how about "This task loads the contents of the tablet into memory and does 
other work..."


http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/transactions/txn_status_manager.h
File src/kudu/transactions/txn_status_manager.h:

http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/transactions/txn_status_manager.h@109
PS6, Line 109: but some operations
             :     // may still be legal.
I might be missing something but is this fact used anywhere? What operations 
are still legal?


http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/transactions/txn_status_manager.h@106
PS6, Line 106:     const Status& replica_status() const { return 
replica_status_; }
             :
             :     // Leadership status of the transaction status manager. If 
not OK, the
             :     // transaction status manager is not the leader, but some 
operations
             :     // may still be legal.
             :     const Status& leader_status() const {
             :       return leader_status_;
             :     }
nit: Are these ever used?


http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/transactions/txn_status_manager.h@118
PS6, Line 118: if (!replica_status_.ok()) {
             :         return replica_status_;
             :       }
             :       return leader_status_;
nit: a bit simpler as

 RETURN_NOT_OK(replica_status_);
 return leader_status_;


http://gerrit.cloudera.org:8080/#/c/16648/5/src/kudu/transactions/txn_status_manager.h
File src/kudu/transactions/txn_status_manager.h:

http://gerrit.cloudera.org:8080/#/c/16648/5/src/kudu/transactions/txn_status_manager.h@230
PS5, Line 230:   // data is loaded and the replic
nit: maybe call it LoadFromTabletForTests() then


http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/transactions/txn_status_manager.cc
File src/kudu/transactions/txn_status_manager.cc:

http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/transactions/txn_status_manager.cc@83
PS6, Line 83: If this time is exceeded, the "
            :              "node crashes."
If the consequence is so severe, maybe we should set this to be something like 
5 minutes? Or, rather than crashing, why not just return an error to the 
client? Or would it be better to just wait?


http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/transactions/txn_status_manager.cc@217
PS6, Line 217: leader_ready_term != initial_term_ ||
             :                     !leader_shared_lock_.owns_lock())
nit: Curious, doesn't this mean that we've changed leaders? The error message 
doesn't seem to reflect that.


http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/transactions/txn_status_manager.cc@235
PS6, Line 235: NOT_THE_LEADER
Should this only be set if leader_status_ is set?


http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/transactions/txn_status_manager.cc@320
PS6, Line 320: RETURN_NOT_OK
nit: just return this?


http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/transactions/txn_status_manager.cc@369
PS6, Line 369:   }
nit: maybe log something about beginning to wait?


http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/transactions/txn_status_manager.cc@414
PS6, Line 414: failed
nit: "interrupted" would be less alarming, given this is somewhat normal


http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/transactions/txn_status_manager.cc@430
PS6, Line 430: Substitute("Tablet")
nit: remove


http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

http://gerrit.cloudera.org:8080/#/c/16648/6/src/kudu/tserver/ts_tablet_manager.cc@1428
PS6, Line 1428: RunTx
nit: maybe name these ShouldRun...

Run...Task makes it seem like the functions actually run the task, rather than 
determine whether we should run.



--
To view, visit http://gerrit.cloudera.org:8080/16648
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I42c1ad095dcb4bdffcbe0ecf9631a60bac208c2a
Gerrit-Change-Number: 16648
Gerrit-PatchSet: 6
Gerrit-Owner: Hao Hao <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Hao Hao <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Tue, 12 Jan 2021 23:52:42 +0000
Gerrit-HasComments: Yes

Reply via email to