[ 
https://issues.apache.org/jira/browse/CASSANDRA-10730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036368#comment-15036368
 ] 

Ariel Weisberg commented on CASSANDRA-10730:
--------------------------------------------

Seems like the netstat output is not there anymore? I looked at the server log 
from one of the failures and it has started listening on the socket.

A flight recording might have more visibility into something the point in time 
snapshots are missing.

I am starting to wonder if this isn't some kind of client library or protocol 
issue. I kind of want to dig into what the client library is experiencing when 
it says it can't connect to the server.

The first quick debug step would be to connect to the port and write some 
garbage and see if you can get a protocol error back from the server. If you 
connect and  get a protocol error back it means that the server 90% works. I 
looked at the code and it tries to do that. I wouldn't parse the error I would 
just consume response data until the socket closes with a few minute timeout.

There is still the important clue of the CPU utilization and the fact that this 
goes away when you move to a bigger instance. A bigger instance means more CPU 
and more memory. But we have visibility into CPU and memory and nothing seems 
particularly wrong. There should be a smoking gun here but we aren't seeing it.

I did notice that CPU utilization isn't always reported by top, but top isn't a 
great way to monitor.

> periodic timeout errors in dtest
> --------------------------------
>
>                 Key: CASSANDRA-10730
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10730
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jim Witschey
>            Assignee: Jim Witschey
>
> Dtests often fail with connection timeout errors. For example:
> http://cassci.datastax.com/job/cassandra-3.1_dtest/lastCompletedBuild/testReport/upgrade_tests.cql_tests/TestCQLNodes3RF3/deletion_test/
> {code}
> ('Unable to connect to any servers', {'127.0.0.1': 
> OperationTimedOut('errors=Timed out creating connection (10 seconds), 
> last_host=None',)})
> {code}
> We've merged a PR to increase timeouts:
> https://github.com/riptano/cassandra-dtest/pull/663
> It doesn't look like this has improved things:
> http://cassci.datastax.com/view/cassandra-3.0/job/cassandra-3.0_dtest/363/testReport/
> Next steps here are
> * to scrape Jenkins history to see if and how the number of tests failing 
> this way has increased (it feels like it has). From there we can bisect over 
> the dtests, ccm, or C*, depending on what looks like the source of the 
> problem.
> * to better instrument the dtest/ccm/C* startup process to see why the nodes 
> start but don't successfully make the CQL port available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to