Hello Mike Percy, Kudu Jenkins, Adar Dembo,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12185

to look at the new patch set (#2).

Change subject: Reduce severity of "Error trying to read ahead of the log" log 
message
......................................................................

Reduce severity of "Error trying to read ahead of the log" log message

In clusters under load, it is typical to see the ERROR log dominated by
messages like the following:

E1221 09:09:13.869918 143384 consensus_queue.cc:440] T 
1d030514317942ec9d7796a8a029dace P a4eea0affa4545879c8988b24d8cb2bb [LEADER]: 
Error trying to read ahead of the log while preparing peer request: Incomplete: 
Op with index 6620 is ahead of the local log (next sequential op: 6620). 
Destination peer: Peer: c0a2e34b708845efb8f090459815272c, Is new: false, Last 
received: 2995.6619, Next index: 6621, Last known committed idx: 6620, Last 
exchange result: ERROR, Needs tablet copy: false

This message is logged at the error level, and looks a little bit scary,
but it is in fact harmless. Here's what happens:

1. A leader neglects its duties and the followers elect a new leader.
2. The new leader manages to replicate more ops (usually just the NO_OP
   asserting leadership).
3. The leader of the previous term tries to replicate an op to a peer in
   the new term.
4. From the response, it founds out that
   a. It is in an earlier term, so it should step down and increment its
      term.
   b. The last op the peer saw is (leader's index + k) for some k > 0
      (usually k = 1). So the leader will attempt to send up ops of
      index up through (leader's index + k).
5. The term change is asynchronous, and before it happens the leader
   tries to prepare a new request to the peer whose log is ahead of the
   local log. This causes the ERROR message.
6. The term change happens. The leader steps down, and life goes on.

Note that the leader should also have received a VoteRequest with the
new term and an UpdateConsensus with the new term from the leader. In
general, this message appears only when the leader is under enough
stress to lose its leadership and be unable to service some consensus
RPCs in a timely manner. It is not in an of itself a problem, and it's a
decent indicator of stress on the leader, so I am leaving the message
but downgrading it to INFO level.

See KUDU-1078 for more information about this situation, especially its
previous causes (which were at one time actually harmful).

Change-Id: Ib841a52354e5c71450f3e364ab2fdee51ce2adb4
---
M src/kudu/consensus/consensus_queue.cc
1 file changed, 6 insertions(+), 5 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/85/12185/2
--
To view, visit http://gerrit.cloudera.org:8080/12185
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib841a52354e5c71450f3e364ab2fdee51ce2adb4
Gerrit-Change-Number: 12185
Gerrit-PatchSet: 2
Gerrit-Owner: Will Berkeley <wdberke...@gmail.com>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mpe...@apache.org>

Reply via email to