Will Berkeley has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/11801 )

Change subject: Improve consensus queue overflow logging
......................................................................

Improve consensus queue overflow logging

Suppose tablet server X is a leader of T tablets for which tablet server Y is a
follower. The relevant situation is when T is on the order of 100-1000. If Y
strains under its consensus load and falls behind processing consensus service
requests, UpdateConsensus requests from the leader will get rejected and cause
a message to be logged on the leader X for each of the T tablets. The message
looks like:

W1022 17:20:59.767554 13057 consensus_peers.cc:422] T 
9255fdf03ad4451e9fcd62f26741bfe6 P 892cc0d4442c4cdaaf633ed2732f9246 -> Peer 
dc0af5867d52468f8fd47abf13c08040 (tablet_server_Y.kudu.com:7050): Couldn't send 
request to peer dc0af5867d52468f8fd47abf13c08040 for tablet 
9255fdf03ad4451e9fcd62f26741bfe6. Status: Remote error: Service unavailable: 
UpdateConsensus request on kudu.consensus.ConsensusService from 10.1.1.1:55528 
dropped due to backpressure. The service queue is full; it has 50 items.. 
Retrying in the next heartbeat period. Already tried 1 times.

Y's consensus service pool also logs the same thing, but it doesn't have the
information about the tablet id or peer ids available to it, and it is throttled
to occur no more than once per second:

W1022 17:37:33.535168  4330 service_pool.cc:130] UpdateConsensus request on 
kudu.consensus.ConsensusService from 10.45.26.115:36820 dropped due to 
backpressure. The service queue is full; it has 50 items.

This patch attempts to reduce the spam of the first message in the logs
by throttling it to occur once every 5 retries. It still is logged for
every tablet peer, but those messages are useful if one wants to trace
the history of a particular tablet.

I also added the throttling messages to Y's output, so it's now

W1022 17:37:33.535168  4330 service_pool.cc:130] UpdateConsensus request on 
kudu.consensus.ConsensusService from 10.45.26.115:36820 dropped due to 
backpressure. The service queue is full; it has 50 items. [suppressed 5 similar 
messages]

when e.g. 5 other messages have been suppressed.

Change-Id: I7697c63babefac0f76bcc8c87d70f7e7125e55cc
Reviewed-on: http://gerrit.cloudera.org:8080/11801
Tested-by: Will Berkeley <wdberke...@gmail.com>
Reviewed-by: Alexey Serbin <aser...@cloudera.com>
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_peers.h
M src/kudu/rpc/service_pool.cc
M src/kudu/util/logging.h
4 files changed, 22 insertions(+), 10 deletions(-)

Approvals:
  Will Berkeley: Verified
  Alexey Serbin: Looks good to me, approved

--
To view, visit http://gerrit.cloudera.org:8080/11801
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I7697c63babefac0f76bcc8c87d70f7e7125e55cc
Gerrit-Change-Number: 11801
Gerrit-PatchSet: 3
Gerrit-Owner: Will Berkeley <wdberke...@gmail.com>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com>
Gerrit-Reviewer: Will Berkeley <wdberke...@gmail.com>

Reply via email to