[
https://issues.apache.org/jira/browse/CASSANDRA-4288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jonathan Ellis resolved CASSANDRA-4288.
---------------------------------------
Resolution: Fixed
changed to info and committed.
> prevent thrift server from starting before gossip has settled
> -------------------------------------------------------------
>
> Key: CASSANDRA-4288
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4288
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Reporter: Peter Schuller
> Assignee: Chris Burroughs
> Fix For: 2.0.5
>
> Attachments: CASSANDRA-4288-trunk.txt, j4288-1.2-v1-txt,
> j4288-1.2-v2-txt, j4288-1.2-v3.txt
>
>
> A serious problem is that there is no co-ordination whatsoever between gossip
> and the consumers of gossip. In particular, on a large cluster with hundreds
> of nodes, it takes several seconds for gossip to settle because the gossip
> stage is CPU bound. This leads to a node starting up and accessing thrift
> traffic long before it has any clue of what up and down. This leads to
> client-visible timeouts (for nodes that are down but not identified as such)
> and UnavailableException (for nodes that are up but not yet identified as
> such). This is really bad in general, but in particular for clients doing
> non-idempotent writes (counter increments).
> I was going to fix this as part of more significant re-writing in other
> tickets having to do with gossip/topology/etc, but that's not going to
> happen. So, the attached patch is roughly what we're running with in
> production now to make restarts bearable. The minimum wait time is both for
> ensuring that gossip has time to start becoming CPU bound if it will be, and
> the reason it's large is to allow for down nodes to be identified as such in
> most typical cases with a default phi conviction threshold (untested, we
> actually ran with a smaller number of 5 seconds minimum, but from past
> experience I believe 15 seconds is enough).
> The patch is tested on our 1.1 branch. It applies on trunk, and the diff is
> against trunk, but I have not tested it against trunk.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)