[ 
https://issues.apache.org/jira/browse/KUDU-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15448056#comment-15448056
 ] 

Todd Lipcon commented on KUDU-1586:
-----------------------------------

I diagnosed this by bumping the vlog level to 2 on this host for a second or 
two (using ts-cli set_flag --force v 2).

{code}
I0829 21:55:05.215745 13731 log_cache.cc:307] T 
7919fcd47fd34c4989ce214d05e62d41 P 38d4433bb09948e58d10a74ba5f97c8b: 
Successfully read 1 ops from disk (611738..611738)
I0829 21:55:05.215782 13731 consensus_queue.cc:382] T 
7919fcd47fd34c4989ce214d05e62d41 P 38d4433bb09948e58d10a74ba5f97c8b [LEADER]: 
Sending status only request to Peer: b18f54151bc04da59520fdb086d5b571: 
tablet_id: "7919fcd47fd34c4989ce214d05e62d41"
caller_uuid: "38d4433bb09948e58d10a74ba5f97c8b"
caller_term: 175
preceding_id {
  term: 174
  index: 611737
}
committed_index {
  term: 175
  index: 611763
}
{code}

it appears that even though the remote peer was lagging behind, the leader was 
just sending status-only requests, probably because this single op was larger 
than the target batch size (1MB). I used ts-cli set_flag 
consensus_max_batch_size_bytes to set to 4MB and the loop stopped itself.

> If a single op is larger than consensus_max_batch_size_bytes, consensus gets 
> stuck
> ----------------------------------------------------------------------------------
>
>                 Key: KUDU-1586
>                 URL: https://issues.apache.org/jira/browse/KUDU-1586
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>    Affects Versions: 0.10.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Blocker
>
> I noticed on a cluster test that a leader was spinning with log messages like:
> I0829 14:17:31.870786 22184 log_cache.cc:307] T 
> e7cacfdb22744496a6d5d66227a69823 P 5d15962d2f2445b1ba15b93ead4fb31b: 
> Successfully read 1 ops from disk (866604..866604)
> I0829 14:17:31.873234  6186 log_cache.cc:307] T 
> e7cacfdb22744496a6d5d66227a69823 P 5d15962d2f2445b1ba15b93ead4fb31b: 
> Successfully read 1 ops from disk (866604..866604)
> I0829 14:17:31.875713 22184 log_cache.cc:307] T 
> e7cacfdb22744496a6d5d66227a69823 P 5d15962d2f2445b1ba15b93ead4fb31b: 
> Successfully read 1 ops from disk (866604..866604)
> I0829 14:17:31.878078  6186 log_cache.cc:307] T 
> e7cacfdb22744496a6d5d66227a69823 P 5d15962d2f2445b1ba15b93ead4fb31b: 
> Successfully read 1 ops from disk (866604..866604)
> After investigation, it seems this op was larger than 1MB (default consensus 
> batch size) and this caused this tight loop behavior with no progress.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to