[ 
https://issues.apache.org/jira/browse/CASSANDRA-10730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034423#comment-15034423
 ] 

Jim Witschey commented on CASSANDRA-10730:
------------------------------------------

I've run the dtests in a couple diagnostic ways here. First off: I've run the 
normal cassandra-3.0 dtest job on m3.2xlarge instances instead of xlarge 
instances since last Wednesday:

http://cassci.datastax.com/view/cassandra-3.0/job/cassandra-3.0_dtest/jobConfigHistory/showDiffFiles?timestamp1=2015-11-16_17-20-24&timestamp2=2015-11-25_19-20-58

Since then, I haven't seen any connection timeouts on that job. There's no 
guarantee that this will continue to hold, but going from 4 vCPUs/15 GiB to 8 
vCPUs/30 GiB has prevented timeouts so far.

I've also got a custom dtest branch that prints debug information when an 
attempt to connect times out. The branch is here:

https://github.com/mambocab/cassandra-dtest/tree/improve-timeout-debugging

And here's an example of the tests running and producing that debug output:

http://cassci.datastax.com/job/mambocab-cassandra-3.0-dtest/10/testReport/

This is one of the tests that times out:

http://cassci.datastax.com/job/mambocab-cassandra-3.0-dtest/10/testReport/thrift_tests/TestMutations/test_batch_mutate_remove_slice_of_entire_supercolumns/

The output that indicates it timed out is "local variable 'session' referenced 
before assignment" rather than the usual timeout output because of a bug in my 
debugging code. I believe the output collected is still useful.

One pattern I've found is that, after the timed-out connections, there's always 
a Java process owned by the automaton user using ~100% CPU in the output of 
{{top}}. I'm running more builds to confirm that this is Cassandra, but I'd be 
surprised if it weren't.

Is this information -- info about the patterns that the failures follow, and 
the {{jstack}}, {{netstat}}, and {{top}} output -- helpful? [~aweisberg] Do you 
have any thoughts? I'm not sure what to make of it. If C* doesn't make the CQL 
port available for minutes under certain circumstances -- like running with 15G 
memory -- that seems like a bug to me.

> periodic timeout errors in dtest
> --------------------------------
>
>                 Key: CASSANDRA-10730
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10730
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jim Witschey
>            Assignee: Jim Witschey
>
> Dtests often fail with connection timeout errors. For example:
> http://cassci.datastax.com/job/cassandra-3.1_dtest/lastCompletedBuild/testReport/upgrade_tests.cql_tests/TestCQLNodes3RF3/deletion_test/
> {code}
> ('Unable to connect to any servers', {'127.0.0.1': 
> OperationTimedOut('errors=Timed out creating connection (10 seconds), 
> last_host=None',)})
> {code}
> We've merged a PR to increase timeouts:
> https://github.com/riptano/cassandra-dtest/pull/663
> It doesn't look like this has improved things:
> http://cassci.datastax.com/view/cassandra-3.0/job/cassandra-3.0_dtest/363/testReport/
> Next steps here are
> * to scrape Jenkins history to see if and how the number of tests failing 
> this way has increased (it feels like it has). From there we can bisect over 
> the dtests, ccm, or C*, depending on what looks like the source of the 
> problem.
> * to better instrument the dtest/ccm/C* startup process to see why the nodes 
> start but don't successfully make the CQL port available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to