[
https://issues.apache.org/jira/browse/KUDU-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jean-Daniel Cryans resolved KUDU-1512.
--------------------------------------
Resolution: Duplicate
Fix Version/s: NA
Target Version/s: (was: 1.0.0)
Oh yeah it's a dupe.
> Remote bootstrap always fails under heavy insert load
> -----------------------------------------------------
>
> Key: KUDU-1512
> URL: https://issues.apache.org/jira/browse/KUDU-1512
> Project: Kudu
> Issue Type: Bug
> Components: consensus
> Reporter: Jean-Daniel Cryans
> Priority: Blocker
> Fix For: NA
>
>
> I just noticed on a test cluster a case where after remote bootstrapping a
> tablet we were lacking the proper logs to start replicating to it. Here's a
> bit of log:
> {noformat}
> I0701 17:07:23.387614 61379 consensus_peers.cc:296] T
> 807ff8e42640482d8d947b693d56ce03 P 9e59a4c24de44e3f9de219df865b4f3b -> Peer
> d80a7427c7d040ac8d949d0cadb3e7c5 (e1316.halxg.cloudera.com:7050): Sending
> request to remotely bootstrap
> ...
> I0701 17:12:15.867938 65505 log.cc:728] Deleting log segment in path:
> /data/1/kudu-tserver/wals/807ff8e42640482d8d947b693d56ce03/wal-000000217
> (GCed ops < 256735)
> ... (TS stopped GC logs while remote bootstrap was finishing)
> I0701 17:22:48.354138 413 remote_bootstrap_service.cc:242] Request end of
> remote bootstrap session
> d80a7427c7d040ac8d949d0cadb3e7c5-807ff8e42640482d8d947b693d56ce03 received
> from {real_user=kudu, eff_user=} at 10.20.130.116:48132
> I0701 17:22:48.494417 65505 log.cc:728] Deleting log segment in path:
> /data/1/kudu-tserver/wals/807ff8e42640482d8d947b693d56ce03/wal-000000218
> (GCed ops < 276284)
> I0701 17:23:02.591763 7627 consensus_queue.cc:577] T
> 807ff8e42640482d8d947b693d56ce03 P 9e59a4c24de44e3f9de219df865b4f3b [LEADER]:
> Connected to new peer: Peer: d80a7427c7d040ac8d949d0cadb3e7c5, Is new: false,
> Last received: 21.256735, Next index: 256736, Last known committed idx:
> 256493, Last exchange result: ERROR, Needs remote bootstrap: false
> I0701 17:23:02.608044 7627 consensus_peers.cc:181] T
> 807ff8e42640482d8d947b693d56ce03 P 9e59a4c24de44e3f9de219df865b4f3b -> Peer
> d80a7427c7d040ac8d949d0cadb3e7c5 (e1316.halxg.cloudera.com:7050): Could not
> obtain request from queue for peer: d80a7427c7d040ac8d949d0cadb3e7c5. Status:
> Not found: Failed to read ops 256736..279156: Segment 218 which contained
> index 256736 has been GCed
> {noformat}
> So 9e59a4c24de44e3f9de219df865b4f3b was sending data to
> d80a7427c7d040ac8d949d0cadb3e7c5 for about 16 minutes while receiving
> inserts. As soon as the new follower was done bootstrapping, we GC'd the logs
> we were holding for it. What happened after is that the leader dropped that
> new node from the config, and started all over again... over and over.
> Eventually the other follower died for a different reason and we never
> recovered the tablet.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)