[
https://issues.apache.org/jira/browse/KUDU-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15448056#comment-15448056
]
Todd Lipcon commented on KUDU-1586:
-----------------------------------
I diagnosed this by bumping the vlog level to 2 on this host for a second or
two (using ts-cli set_flag --force v 2).
{code}
I0829 21:55:05.215745 13731 log_cache.cc:307] T
7919fcd47fd34c4989ce214d05e62d41 P 38d4433bb09948e58d10a74ba5f97c8b:
Successfully read 1 ops from disk (611738..611738)
I0829 21:55:05.215782 13731 consensus_queue.cc:382] T
7919fcd47fd34c4989ce214d05e62d41 P 38d4433bb09948e58d10a74ba5f97c8b [LEADER]:
Sending status only request to Peer: b18f54151bc04da59520fdb086d5b571:
tablet_id: "7919fcd47fd34c4989ce214d05e62d41"
caller_uuid: "38d4433bb09948e58d10a74ba5f97c8b"
caller_term: 175
preceding_id {
term: 174
index: 611737
}
committed_index {
term: 175
index: 611763
}
{code}
it appears that even though the remote peer was lagging behind, the leader was
just sending status-only requests, probably because this single op was larger
than the target batch size (1MB). I used ts-cli set_flag
consensus_max_batch_size_bytes to set to 4MB and the loop stopped itself.
> If a single op is larger than consensus_max_batch_size_bytes, consensus gets
> stuck
> ----------------------------------------------------------------------------------
>
> Key: KUDU-1586
> URL: https://issues.apache.org/jira/browse/KUDU-1586
> Project: Kudu
> Issue Type: Bug
> Components: consensus
> Affects Versions: 0.10.0
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Priority: Blocker
>
> I noticed on a cluster test that a leader was spinning with log messages like:
> I0829 14:17:31.870786 22184 log_cache.cc:307] T
> e7cacfdb22744496a6d5d66227a69823 P 5d15962d2f2445b1ba15b93ead4fb31b:
> Successfully read 1 ops from disk (866604..866604)
> I0829 14:17:31.873234 6186 log_cache.cc:307] T
> e7cacfdb22744496a6d5d66227a69823 P 5d15962d2f2445b1ba15b93ead4fb31b:
> Successfully read 1 ops from disk (866604..866604)
> I0829 14:17:31.875713 22184 log_cache.cc:307] T
> e7cacfdb22744496a6d5d66227a69823 P 5d15962d2f2445b1ba15b93ead4fb31b:
> Successfully read 1 ops from disk (866604..866604)
> I0829 14:17:31.878078 6186 log_cache.cc:307] T
> e7cacfdb22744496a6d5d66227a69823 P 5d15962d2f2445b1ba15b93ead4fb31b:
> Successfully read 1 ops from disk (866604..866604)
> After investigation, it seems this op was larger than 1MB (default consensus
> batch size) and this caused this tight loop behavior with no progress.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)