[
https://issues.apache.org/jira/browse/CASSANDRA-10730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034423#comment-15034423
]
Jim Witschey commented on CASSANDRA-10730:
------------------------------------------
I've run the dtests in a couple diagnostic ways here. First off: I've run the
normal cassandra-3.0 dtest job on m3.2xlarge instances instead of xlarge
instances since last Wednesday:
http://cassci.datastax.com/view/cassandra-3.0/job/cassandra-3.0_dtest/jobConfigHistory/showDiffFiles?timestamp1=2015-11-16_17-20-24×tamp2=2015-11-25_19-20-58
Since then, I haven't seen any connection timeouts on that job. There's no
guarantee that this will continue to hold, but going from 4 vCPUs/15 GiB to 8
vCPUs/30 GiB has prevented timeouts so far.
I've also got a custom dtest branch that prints debug information when an
attempt to connect times out. The branch is here:
https://github.com/mambocab/cassandra-dtest/tree/improve-timeout-debugging
And here's an example of the tests running and producing that debug output:
http://cassci.datastax.com/job/mambocab-cassandra-3.0-dtest/10/testReport/
This is one of the tests that times out:
http://cassci.datastax.com/job/mambocab-cassandra-3.0-dtest/10/testReport/thrift_tests/TestMutations/test_batch_mutate_remove_slice_of_entire_supercolumns/
The output that indicates it timed out is "local variable 'session' referenced
before assignment" rather than the usual timeout output because of a bug in my
debugging code. I believe the output collected is still useful.
One pattern I've found is that, after the timed-out connections, there's always
a Java process owned by the automaton user using ~100% CPU in the output of
{{top}}. I'm running more builds to confirm that this is Cassandra, but I'd be
surprised if it weren't.
Is this information -- info about the patterns that the failures follow, and
the {{jstack}}, {{netstat}}, and {{top}} output -- helpful? [~aweisberg] Do you
have any thoughts? I'm not sure what to make of it. If C* doesn't make the CQL
port available for minutes under certain circumstances -- like running with 15G
memory -- that seems like a bug to me.
> periodic timeout errors in dtest
> --------------------------------
>
> Key: CASSANDRA-10730
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10730
> Project: Cassandra
> Issue Type: Bug
> Reporter: Jim Witschey
> Assignee: Jim Witschey
>
> Dtests often fail with connection timeout errors. For example:
> http://cassci.datastax.com/job/cassandra-3.1_dtest/lastCompletedBuild/testReport/upgrade_tests.cql_tests/TestCQLNodes3RF3/deletion_test/
> {code}
> ('Unable to connect to any servers', {'127.0.0.1':
> OperationTimedOut('errors=Timed out creating connection (10 seconds),
> last_host=None',)})
> {code}
> We've merged a PR to increase timeouts:
> https://github.com/riptano/cassandra-dtest/pull/663
> It doesn't look like this has improved things:
> http://cassci.datastax.com/view/cassandra-3.0/job/cassandra-3.0_dtest/363/testReport/
> Next steps here are
> * to scrape Jenkins history to see if and how the number of tests failing
> this way has increased (it feels like it has). From there we can bisect over
> the dtests, ccm, or C*, depending on what looks like the source of the
> problem.
> * to better instrument the dtest/ccm/C* startup process to see why the nodes
> start but don't successfully make the CQL port available.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)