[
https://issues.apache.org/jira/browse/KUDU-3275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Grant Henke updated KUDU-3275:
------------------------------
Attachment: raft_consensus-itest.2.txt.gz
> RaftConsensusITest.TestSlowFollower is Flaky
> --------------------------------------------
>
> Key: KUDU-3275
> URL: https://issues.apache.org/jira/browse/KUDU-3275
> Project: Kudu
> Issue Type: Bug
> Affects Versions: 1.14.0
> Reporter: Grant Henke
> Priority: Major
> Attachments: raft_consensus-itest.2.txt.gz
>
>
> I have seen RaftConsensusITest.TestSlowFollower fail quite a few times with
> the following:
> {code:java}
> I0411 07:15:34.772778 3404 raft_consensus.cc:1200] T
> a82c506675e74d62a75e53acd05e86e8 P 8110dc17824943a68d3051e0df96ae5d [term 1
> FOLLOWER]: Deduplicated request from leader. Original:
> 1.18320->[1.18321-1.18746] Dedup: 1.18746->[]I0411 07:15:34.772778 3404
> raft_consensus.cc:1200] T a82c506675e74d62a75e53acd05e86e8 P
> 8110dc17824943a68d3051e0df96ae5d [term 1 FOLLOWER]: Deduplicated request from
> leader. Original: 1.18320->[1.18321-1.18746] Dedup: 1.18746->[]W0411
> 07:15:34.798368 3632 log.cc:502] T a82c506675e74d62a75e53acd05e86e8 P
> 8110dc17824943a68d3051e0df96ae5d: Injecting 1096ms of latency in
> SegmentAllocator::Sync()W0411 07:15:34.834003 3206 consensus_peers.cc:480] T
> a82c506675e74d62a75e53acd05e86e8 P 5698d4de566c4adbae64b6f234c0561d -> Peer
> 8110dc17824943a68d3051e0df96ae5d (127.28.159.66:42271): Couldn't send request
> to peer 8110dc17824943a68d3051e0df96ae5d. Status: Timed out: UpdateConsensus
> RPC to 127.28.159.66:42271 timed out after 0.050s (SENT). This is attempt 1:
> this message will repeat every 5th retry.W0411 07:15:35.333277 3366
> tablet_service.cc:2874] Error setting up scanner with request: Service
> unavailable: Timed out: could not wait for desired snapshot timestamp to be
> consistent: Tablet is lagging too much to be able to serve snapshot scan.
> Lagging by: 45366 ms, (max is 30000 ms):: new_scan_request { tablet_id:
> "a82c506675e74d62a75e53acd05e86e8" projected_columns { name: "key" type:
> INT32 is_key: true is_nullable: false } projected_columns { name: "int_val"
> type: INT32 is_key: false is_nullable: false } projected_columns { name:
> "string_val" type: STRING is_key: false is_nullable: true } read_mode:
> READ_AT_SNAPSHOT propagated_timestamp: 6627841138429693952 cache_blocks: true
> order_mode: ORDERED row_format_flags: 0 authz_token { token_data:
> "\010\210\306\312\203\006\"3\n\005slave\022*\n
> 4b6d5a12476d4dfcb5a4ab8a6f1190cb\020\001\030\001 \001(\001" signature:
> "<redacted>""\036\312x4\323\300\350\177k\357\226a<?\375\366\272\373\220\211\031p\036@9,\364j\2608q\031\331\270\355\214\373k8tT\224\3666\000RM\033UbQ\316\247\236#\364\302VW\302\026fJ\'"
> signing_key_seq_num: 0 } } call_seq_id: 0W0411 07:15:35.635779 3366
> tablet_service.cc:2874] Error setting up scanner with request: Service
> unavailable: Timed out: could not wait for desired snapshot timestamp to be
> consistent: Tablet is lagging too much to be able to serve snapshot scan.
> Lagging by: 45668 ms, (max is 30000 ms):: new_scan_request { tablet_id:
> "a82c506675e74d62a75e53acd05e86e8" projected_columns { name: "key" type:
> INT32 is_key: true is_nullable: false } projected_columns { name: "int_val"
> type: INT32 is_key: false is_nullable: false } projected_columns { name:
> "string_val" type: STRING is_key: false is_nullable: true } read_mode:
> READ_AT_SNAPSHOT propagated_timestamp: 6627841138425311232 cache_blocks: true
> order_mode: ORDERED row_format_flags: 0 authz_token { token_data:
> "\010\210\306\312\203\006\"3\n\005slave\022*\n
> 4b6d5a12476d4dfcb5a4ab8a6f1190cb\020\001\030\001 \001(\001" signature:
> "<redacted>""\036\312x4\323\300\350\177k\357\226a<?\375\366\272\373\220\211\031p\036@9,\364j\2608q\031\331\270\355\214\373k8tT\224\3666\000RM\033UbQ\316\247\236#\364\302VW\302\026fJ\'"
> signing_key_seq_num: 0 } } call_seq_id: 0W0411 07:15:35.636209 3626
> scanner-internal.cc:406] Time spent opening tablet: real 57.698s user 0.002s
> sys 0.000sF0411 07:15:35.636267 3626 test_workload.cc:255] Check failed:
> _s.ok() Bad status: Timed out: exceeded configured scan timeout of 60.000s:
> after 20 scan attempts: unable to retry before timeout: Remote error: Service
> unavailable: Timed out: could not wait for desired snapshot timestamp to be
> consistent: Timed out waiting for ts: P: 1618125277942117 usec, L: 0 to be
> safe (mode: NON-LEADER). Current safe time: P: 1618125277252906 usec, L: 0
> Physical time difference: 0.689s*** Check failure stack trace: ****** Aborted
> at 1618125335 (unix time) try "date -d @1618125335" if you are using GNU date
> ***PC: @ 0x7fe2e72bafb7 gsignal*** SIGABRT (@0x3e80000727d) received by
> PID 29309 (TID 0x7fe2d979a700) from PID 29309; stack trace: *** @
> 0x7fe2e97881f1 google::(anonymous namespace)::FailureSignalHandler() at ??:0
> @ 0x7fe2eb20e980 (unknown) at ??:0 @ 0x7fe2e72bafb7 gsignal at
> ??:0 @ 0x7fe2e72bc921 abort at ??:0 @ 0x7fe2e9778439
> google::logging_fail() at ??:0 @ 0x7fe2e977bebd
> google::LogMessage::Fail() at ??:0 @ 0x7fe2e977dace
> google::LogMessage::SendToLog() at ??:0 @ 0x7fe2e977ba40
> google::LogMessage::Flush() at ??:0 @ 0x7fe2e977e3b9
> google::LogMessageFatal::~LogMessageFatal() at ??:0 @ 0x7fe2efae81f0
> kudu::TestWorkload::ReadThread() at ??:0 @ 0x7fe2eaa838df
> execute_native_thread_routine at ??:0 @ 0x7fe2eb2036db start_thread at
> ??:0 @ 0x7fe2e739d71f clone at ??:0I0411 07:15:35.762290 4692
> consensus_queue.cc:778] T a82c506675e74d62a75e53acd05e86e8 P
> 5698d4de566c4adbae64b6f234c0561d [LEADER]: Peer
> 8110dc17824943a68d3051e0df96ae5d is lagging by at least 66546 ops behind the
> committed index [suppressed 8 similar messages]{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)