[
https://issues.apache.org/jira/browse/CASSANDRA-6156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13800740#comment-13800740
]
Jonathan Ellis commented on CASSANDRA-6156:
-------------------------------------------
[~yukim] is this worth keeping open? I note that "unable to fetch range" is
gone in 2.0.
> Poor resilience and recovery for bootstrapping node - "unable to fetch range"
> -----------------------------------------------------------------------------
>
> Key: CASSANDRA-6156
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6156
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Reporter: Alyssa Kwan
> Fix For: 1.2.8
>
>
> We have an 8 node cluster on 1.2.8 using vnodes. One of our nodes failed and
> we are having lots of trouble bootstrapping it back. On each attempt,
> bootstrapping eventually fails with a RuntimeException "Unable to fetch
> range". As far as we can tell, long GC pauses on the sender side cause
> heartbeat drops or delays, which leads the gossip controller to convict the
> connection and mark the sender dead. We've done significant GC tuning to
> minimize the duration of pauses and raised phi_convict to its max. It merely
> lets the bootstrap process take longer to fail.
> The inability to reliably add nodes significantly affects our ability to
> scale.
> We're not the only ones:
> http://stackoverflow.com/questions/19199349/cassandra-bootstrap-fails-with-unable-to-fetch-range
> What can we do in the immediate term to bring this node in? And what's the
> long term solution?
> One possible solution would be to allow bootstrapping to be an incremental
> process with individual transfers of vnode ownership instead of attempting to
> transfer the whole set of vnodes transactionally. (I assume that's what's
> happening now.) I don't know what would have to change on the gossip and
> token-aware client side to support this.
> Another solution would be to partition sstable files by vnode and allow
> transfer of those files directly with some sort of checkpointing of and
> incremental transfer of writes after the sstable is transferred.
--
This message was sent by Atlassian JIRA
(v6.1#6144)