[
https://issues.apache.org/jira/browse/CASSANDRA-12966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15706147#comment-15706147
]
Jason Brown commented on CASSANDRA-12966:
-----------------------------------------
I've chosen to solve this by updating {{SystemKeyspace}} in the following ways:
- remove the {{synchronized}} keyword from the functions that update the
{{peers}} table
- on those same functions, execute the update to the {{peers}} asynchronously,
on a mutation stage thread, instead of blocking the current thread (which is
often the Gossip stage thread)
- return a {{Future}} from those functions so that tests can behave properly
||3.0||3.X||trunk||
|[branch|https://github.com/jasobrown/cassandra/tree/12966_3.0]|[branch|https://github.com/jasobrown/cassandra/tree/12966_3.X]|[branch|https://github.com/jasobrown/cassandra/tree/12966_trunk]|
|[dtest|http://cassci.datastax.com/view/Dev/view/jasobrown/job/jasobrown-12966_3.0-dtest/]|[dtest|http://cassci.datastax.com/view/Dev/view/jasobrown/job/jasobrown-12966_3.X-dtest/]|[dtest|http://cassci.datastax.com/view/Dev/view/jasobrown/job/jasobrown-12966_trunk-dtest/]|
|[testall|http://cassci.datastax.com/view/Dev/view/jasobrown/job/jasobrown-12966_3.0-testall/]|[testall|http://cassci.datastax.com/view/Dev/view/jasobrown/job/jasobrown-12966_3.X-testall/]|[testall|http://cassci.datastax.com/view/Dev/view/jasobrown/job/jasobrown-12966_trunk-testall/]|
> Gossip thread slows down when using batch commit log
> ----------------------------------------------------
>
> Key: CASSANDRA-12966
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12966
> Project: Cassandra
> Issue Type: Bug
> Reporter: Jason Brown
> Assignee: Jason Brown
> Priority: Minor
>
> When using batch commit log mode, the Gossip thread slows down when peers
> after a node bounces. This is because we perform a bunch of updates to the
> peers table via {{SystemKeyspace.updatePeerInfo}}, which is a synchronized
> method. How quickly each one of those individual updates takes depends on how
> busy the system is at the time wrt write traffic. If the system is largely
> quiescent, each update will be relatively quick (just waiting for the fsync).
> If the system is getting a lot of writes, and depending on the
> commitlog_sync_batch_window_in_ms, each of the Gossip thread's updates can
> get stuck in the backlog, which causes the Gossip thread to stop processing.
> We have observed in large clusters that a rolling restart causes triggers and
> exacerbates this behavior.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)