[
https://issues.apache.org/jira/browse/CASSANDRA-13095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Benjamin Lerer updated CASSANDRA-13095:
---------------------------------------
Assignee: Danil Smirnov
Status: Patch Available (was: Open)
> Timeouts between nodes
> ----------------------
>
> Key: CASSANDRA-13095
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13095
> Project: Cassandra
> Issue Type: Bug
> Reporter: Danil Smirnov
> Assignee: Danil Smirnov
> Priority: Minor
> Attachments: 13095-2.1.patch
>
>
> Recently I've run into a problem with heavily loaded cluster when sometimes
> messages between certain nodes become blocked with no reason.
> It looks like the same situation that described here
> https://issues.apache.org/jira/browse/CASSANDRA-12676?focusedCommentId=15736166&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15736166
> Thread dump showed infinite loop here:
> https://github.com/apache/cassandra/blob/a8a43dd32eb92406d7d8b105e08c68b3d5c7df49/src/java/org/apache/cassandra/utils/CoalescingStrategies.java#L109
> Apparently the problem is in the initial value of epoch filed in
> TimeHorizonMovingAverageCoalescingStrategy class. When it's value is not
> evenly divisible by BUCKET_INTERVAL, ix(epoch-1) does not point to the
> correct bucket. As a result, sum gradually increases and, upon reaching
> MEASURED_INTERVAL, averageGap becomes 0 and thread blocks.
> It's hard to reproduce because it takes a long time for sum to grow and when
> no messages are send for some time, sum becomes 0
> https://github.com/apache/cassandra/blob/a8a43dd32eb92406d7d8b105e08c68b3d5c7df49/src/java/org/apache/cassandra/utils/CoalescingStrategies.java#L301
> and bug is no longer reproducible (until connection between nodes is
> re-created).
> I've added a patch which should fix the problem. Don't know if it would be of
> any help since CASSANDRA-12676 will apparently disable this behaviour. One
> note about performance regressions though. There is a small chance it being
> result of the bug described here, so it might be worth testing performance
> after fixes and/or tuning the algorithm.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]