Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16648 )
Change subject: (wip) KUDU-2612: restrict TxnStatusManager calls to be made by the leader only ...................................................................... Patch Set 2: (3 comments) http://gerrit.cloudera.org:8080/#/c/16648/2/src/kudu/transactions/txn_status_manager.cc File src/kudu/transactions/txn_status_manager.cc: http://gerrit.cloudera.org:8080/#/c/16648/2/src/kudu/transactions/txn_status_manager.cc@a482 PS2, Line 482: : : : : : : : : : : : : : : Can we at least check that we have acquired the leader lock? http://gerrit.cloudera.org:8080/#/c/16648/2/src/kudu/transactions/txn_status_manager.cc@724 PS2, Line 724: auto* consensus = status_tablet_.tablet_replica_->consensus(); : DCHECK(consensus); nit: can combine as: auto* consensus = DCHECK_NOTNULL(...consensus()); http://gerrit.cloudera.org:8080/#/c/16648/2/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: http://gerrit.cloudera.org:8080/#/c/16648/2/src/kudu/tserver/ts_tablet_manager.cc@1420 PS2, Line 1420: { : shared_lock<RWMutex> l(lock_); : for (const auto& elem : tablet_map_) { : auto r = elem.second; : // Find txn status tablet replicas. : if (r->txn_coordinator()) { : replicas.emplace_back(std::move(r)); : } : } : } : for (auto& r : replicas) { : if (shutdown_latch_.count() == 0) { : return; : } : auto* coordinator = DCHECK_NOTNULL(r->txn_coordinator()); It may be worth considering updating this to look more like how we schedule maintenance ops. E.g. rather than iterating through each replica and trying to check for coordinator, instead when we create a replica that is a coordinator, "register" that coordinator with the TSTabletManager when it's fully started and ready. Conceptually, the problems at hand seems similar, that we want to run tasks using a shared threadpool, and that the tasks are tied to the our replicas, and the replica shutting down shouldn't collide with the task running. Take a look at util/maintenance_manager.cc to see if a similar approach might make sense here, e.g. when shutting down a replica, if an abort task is in progress, we wait for it to complete, and then proceed with the shutdown. That might also help avoid races with this task and the initial call to TabletReplica::Start(). -- To view, visit http://gerrit.cloudera.org:8080/16648 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I42c1ad095dcb4bdffcbe0ecf9631a60bac208c2a Gerrit-Change-Number: 16648 Gerrit-PatchSet: 2 Gerrit-Owner: Hao Hao <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Hao Hao <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Comment-Date: Thu, 24 Dec 2020 00:08:26 +0000 Gerrit-HasComments: Yes
