Hi everyone, For some integration tests, we start up a CassandraDaemon in a separate process (using the Java 7 ProcessBuilder API). All of my integration tests run beautifully on my laptop, but one of them fails on our Jenkins cluster.
The failing integration test does around 10k writes to different rows and then 10k reads. After running some number of reads, the job dies with this error: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /127.0.0.10:58209 (com.datastax.driver.core.exceptions.DriverException: Timeout during read)) This error appears to have occurred because the Cassandra process has stopped. The logs for the Cassandra process show some warnings during batch writes (the batches are too big), no activity for a few minutes (I assume this is because all of the read operations were proceeding smoothly), and then look like the following: INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,903 ThriftServer.java (line 141) Stop listening to thrift clients INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,920 Server.java (line 182) Stop listening for CQL clients INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,930 Gossiper.java (line 1279) Announcing shutdown INFO [StorageServiceShutdownHook] 2014-08-05 19:14:53,930 MessagingService.java (line 683) Waiting for messaging service to quiesce INFO [ACCEPT-/127.0.0.10] 2014-08-05 19:14:53,931 MessagingService.java (line 923) MessagingService has terminated the accept() thread Does anyone have any ideas about how to debug this? Looking around on google I found some threads suggesting that this could occur from an OOM error (http://stackoverflow.com/questions/23755040/cassandra-exits-with-no-errors). Wouldn't such an error be logged, however? The test that fails is a test of our MapReduce Hadoop InputFormat and as such it does some pretty big queries across multiple rows (over a range of partitioning key tokens). The default fetch size I believe is 5000 rows, and the values in the rows I am fetching are just simple strings, so I would not think the amount of data in a single read would be too big. FWIW I don't see any log messages about garbage collection for at least 3min before the process shuts down (and no GC messages after the test stops doing writes and starts doing reads). I'd greatly appreciate any help before my team kills me for breaking our Jenkins build so consistently! :) Best regards, Clint