Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16343 )
Change subject: KUDU-1587 part 2: reject write ops if apply queue is overloaded ...................................................................... Patch Set 6: (6 comments) http://gerrit.cloudera.org:8080/#/c/16343/6//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16343/6//COMMIT_MSG@23 PS6, Line 23: This new behavior is not yet enabled by default, keeping the legacy : behavior of unbounded/uncontrolled queue times as is. To enable it, : set --tablet_apply_pool_overload_threshold_ms to something greater : than 0 (e.g., 500). If not in an automated test, it'd be great if you could manually run a similar workload as that described by the Jira, if anything, to validate that throttled requests are retried, examine effects on overall throughput and memory usage, etc. http://gerrit.cloudera.org:8080/#/c/16343/6/src/kudu/tablet/tablet_replica.h File src/kudu/tablet/tablet_replica.h: http://gerrit.cloudera.org:8080/#/c/16343/6/src/kudu/tablet/tablet_replica.h@411 PS6, Line 411: Random rng_; Shouldn't we be using ThreadSafeRandom, given we call this from the reactors? http://gerrit.cloudera.org:8080/#/c/16343/6/src/kudu/tablet/tablet_replica.cc File src/kudu/tablet/tablet_replica.cc: http://gerrit.cloudera.org:8080/#/c/16343/6/src/kudu/tablet/tablet_replica.cc@445 PS6, Line 445: // If the apply queue is overloaded, reject the incoming operation with : // some probability. : MonoDelta queue_otime; : MonoDelta threshold; : if (apply_pool_->QueueOverloaded(&queue_otime, &threshold)) { : auto overload_threshold_ms = threshold.ToMilliseconds(); : // The longer the queue has been in the overloaded state, the higher the : // probability of rejection. : auto time_factor = queue_otime.ToMilliseconds() / overload_threshold_ms + 1; : // The heavier the request in terms of number of rows, the higher the : // probability of rejecting it if the apply queue is overloaded. : auto size_factor = op_state->request()->row_operations().rows().size(); : auto factor = 1 + size_factor * time_factor; : if (!rng_.OneIn(factor)) { : return Status::ServiceUnavailable("op apply queue is overloaded"); : } : } Could we instead bake this into the TabletService endpoint? We already do some throttling there -- it'd be nice to coalesce if possible, unless there's a reason for spreading it out across subsystems. Plus, that'd allow us to share a single RNG. http://gerrit.cloudera.org:8080/#/c/16343/6/src/kudu/tserver/tablet_server-test.cc File src/kudu/tserver/tablet_server-test.cc: http://gerrit.cloudera.org:8080/#/c/16343/6/src/kudu/tserver/tablet_server-test.cc@4365 PS6, Line 4365: class OpApplyQueueTest : public TabletServerTestBase { Could you also add a small test that even with a heavily throttled queue, clients will retry and eventually succeed? http://gerrit.cloudera.org:8080/#/c/16343/6/src/kudu/tserver/tablet_server-test.cc@4375 PS6, Line 4375: intjected nit: injected http://gerrit.cloudera.org:8080/#/c/16343/6/src/kudu/tserver/tablet_server-test.cc@4468 PS6, Line 4468: that nit: rather than -- To view, visit http://gerrit.cloudera.org:8080/16343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6d7688d6fa832e606b8efc4549568fa52dfa1931 Gerrit-Change-Number: 16343 Gerrit-PatchSet: 6 Gerrit-Owner: Alexey Serbin <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Attila Bukor <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-Comment-Date: Mon, 24 Aug 2020 20:27:20 +0000 Gerrit-HasComments: Yes
