[ 
https://issues.apache.org/jira/browse/CASSANDRA-10730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034475#comment-15034475
 ] 

Ariel Weisberg commented on CASSANDRA-10730:
--------------------------------------------

It definitely sounds like a bug in the server.

I looked at all the threads and none of them are look like they are actually 
doing anything that would explain the CPU utilization. What size heap are the 
nodes started with during dtests? The RSS of the Java process is 713 megabytes 
which makes me wonder if it's caught spinning in GC against a 512 megabyte heap.

The thing to get then is a heap dump with jmap, collect GC logs, or collect a 
flight recording. A flight recording would show what the active threads are if 
we are wrong about it being a GC issue. I think it has to be GC because the 
socket is bound and the thread is listening. It probably just can't run because 
the JVM is wedged.

What's strange is you say it goes away with a bigger instance. Maybe more 
memory leads to a bigger default heap size from the JVM if we aren't specifying 
it?

> periodic timeout errors in dtest
> --------------------------------
>
>                 Key: CASSANDRA-10730
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10730
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jim Witschey
>            Assignee: Jim Witschey
>
> Dtests often fail with connection timeout errors. For example:
> http://cassci.datastax.com/job/cassandra-3.1_dtest/lastCompletedBuild/testReport/upgrade_tests.cql_tests/TestCQLNodes3RF3/deletion_test/
> {code}
> ('Unable to connect to any servers', {'127.0.0.1': 
> OperationTimedOut('errors=Timed out creating connection (10 seconds), 
> last_host=None',)})
> {code}
> We've merged a PR to increase timeouts:
> https://github.com/riptano/cassandra-dtest/pull/663
> It doesn't look like this has improved things:
> http://cassci.datastax.com/view/cassandra-3.0/job/cassandra-3.0_dtest/363/testReport/
> Next steps here are
> * to scrape Jenkins history to see if and how the number of tests failing 
> this way has increased (it feels like it has). From there we can bisect over 
> the dtests, ccm, or C*, depending on what looks like the source of the 
> problem.
> * to better instrument the dtest/ccm/C* startup process to see why the nodes 
> start but don't successfully make the CQL port available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to