Will Berkeley has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/11801 )
Change subject: Improve consensus queue overflow logging ...................................................................... Improve consensus queue overflow logging Suppose tablet server X is a leader of T tablets for which tablet server Y is a follower. The relevant situation is when T is on the order of 100-1000. If Y strains under its consensus load and falls behind processing consensus service requests, UpdateConsensus requests from the leader will get rejected and cause a message to be logged on the leader X for each of the T tablets. The message looks like: W1022 17:20:59.767554 13057 consensus_peers.cc:422] T 9255fdf03ad4451e9fcd62f26741bfe6 P 892cc0d4442c4cdaaf633ed2732f9246 -> Peer dc0af5867d52468f8fd47abf13c08040 (tablet_server_Y.kudu.com:7050): Couldn't send request to peer dc0af5867d52468f8fd47abf13c08040 for tablet 9255fdf03ad4451e9fcd62f26741bfe6. Status: Remote error: Service unavailable: UpdateConsensus request on kudu.consensus.ConsensusService from 10.1.1.1:55528 dropped due to backpressure. The service queue is full; it has 50 items.. Retrying in the next heartbeat period. Already tried 1 times. Y's consensus service pool also logs the same thing, but it doesn't have the information about the tablet id or peer ids available to it, and it is throttled to occur no more than once per second: W1022 17:37:33.535168 4330 service_pool.cc:130] UpdateConsensus request on kudu.consensus.ConsensusService from 10.45.26.115:36820 dropped due to backpressure. The service queue is full; it has 50 items. This patch attempts to reduce the spam of the first message in the logs by throttling it to occur once every 5 retries. It still is logged for every tablet peer, but those messages are useful if one wants to trace the history of a particular tablet. I also added the throttling messages to Y's output, so it's now W1022 17:37:33.535168 4330 service_pool.cc:130] UpdateConsensus request on kudu.consensus.ConsensusService from 10.45.26.115:36820 dropped due to backpressure. The service queue is full; it has 50 items. [suppressed 5 similar messages] when e.g. 5 other messages have been suppressed. Change-Id: I7697c63babefac0f76bcc8c87d70f7e7125e55cc Reviewed-on: http://gerrit.cloudera.org:8080/11801 Tested-by: Will Berkeley <wdberke...@gmail.com> Reviewed-by: Alexey Serbin <aser...@cloudera.com> --- M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_peers.h M src/kudu/rpc/service_pool.cc M src/kudu/util/logging.h 4 files changed, 22 insertions(+), 10 deletions(-) Approvals: Will Berkeley: Verified Alexey Serbin: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/11801 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I7697c63babefac0f76bcc8c87d70f7e7125e55cc Gerrit-Change-Number: 11801 Gerrit-PatchSet: 3 Gerrit-Owner: Will Berkeley <wdberke...@gmail.com> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com> Gerrit-Reviewer: Will Berkeley <wdberke...@gmail.com>