Dan Burkert has posted comments on this change. Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet ......................................................................
Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/6925/3/src/kudu/tserver/tablet_service.cc File src/kudu/tserver/tablet_service.cc: Line 1067: // Skip calling SetupErrorAndRespond since this path doesn't need the > Check out the 'Advanced per-instance throttling' section of util/logging.h OK so after trying this out on a cluster, I think we should allow it to log every time. To balance this out, I think we should downgrade the 'tablet x needs tablet copy' message. The net result is that we're logging the begin tablet copy result instead of the fact that we'll be requesting it. For reference, here's a cross section of these logs for a particular tablet: I0522 17:09:38.703362 15818 consensus_queue.cc:395] T c03811b02d7045e9a8cc426246c9595c P 70f7ee61ead54b1885d819f354eb3405 [LEADER]: Peer cc32936bc8594948a04fd4240da36aed needs tablet copy W0522 17:09:38.703636 4776 consensus_peers.cc:352] T c03811b02d7045e9a8cc426246c9595c P 70f7ee61ead54b1885d819f354eb3405 -> Peer cc32936bc8594948a04fd4240da36aed (vd0236.halxg.cloudera.com:7050): Unable to begin Tablet Copy on peer: error { code: THROTTLED status { code: SERVICE_UNAVAILABLE message: "Thread pool is at capacity (10/10 tasks running, 0/0 tasks queued)" } } I0522 17:09:40.211633 15820 consensus_queue.cc:395] T c03811b02d7045e9a8cc426246c9595c P 70f7ee61ead54b1885d819f354eb3405 [LEADER]: Peer cc32936bc8594948a04fd4240da36aed needs tablet copy W0522 17:09:40.211971 4776 consensus_peers.cc:352] T c03811b02d7045e9a8cc426246c9595c P 70f7ee61ead54b1885d819f354eb3405 -> Peer cc32936bc8594948a04fd4240da36aed (vd0236.halxg.cloudera.com:7050): Unable to begin Tablet Copy on peer: error { code: THROTTLED status { code: SERVICE_UNAVAILABLE message: "Thread pool is at capacity (10/10 tasks running, 0/0 tasks queued)" } } I0522 17:09:41.703528 11794 consensus_queue.cc:395] T c03811b02d7045e9a8cc426246c9595c P 70f7ee61ead54b1885d819f354eb3405 [LEADER]: Peer cc32936bc8594948a04fd4240da36aed needs tablet copy W0522 17:09:41.703760 4776 consensus_peers.cc:352] T c03811b02d7045e9a8cc426246c9595c P 70f7ee61ead54b1885d819f354eb3405 -> Peer cc32936bc8594948a04fd4240da36aed (vd0236.halxg.cloudera.com:7050): Unable to begin Tablet Copy on peer: error { code: THROTTLED status { code: SERVICE_UNAVAILABLE message: "Thread pool is at capacity (10/10 tasks running, 0/0 tasks queued)" } } On clusters approaching normalcy, I wouldn't expect to see these logs much at all. -- To view, visit http://gerrit.cloudera.org:8080/6925 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan Burkert <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Dan Burkert <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
