Hi everyone,

For some integration tests, we start up a CassandraDaemon in a
separate process (using the Java 7 ProcessBuilder API).  All of my
integration tests run beautifully on my laptop, but one of them fails
on our Jenkins cluster.

The failing integration test does around 10k writes to different rows
and then 10k reads.  After running some number of reads, the job dies
with this error:

com.datastax.driver.core.exceptions.NoHostAvailableException: All
host(s) tried for query failed (tried: /127.0.0.10:58209
(com.datastax.driver.core.exceptions.DriverException: Timeout during
read))

This error appears to have occurred because the Cassandra process has
stopped.  The logs for the Cassandra process show some warnings during
batch writes (the batches are too big), no activity for a few minutes
(I assume this is because all of the read operations were proceeding
smoothly), and then look like the following:

INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,903
ThriftServer.java (line 141) Stop listening to thrift clients
 INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,920 Server.java
(line 182) Stop listening for CQL clients
 INFO [StorageServiceShutdownHook] 2014-08-05 19:14:51,930
Gossiper.java (line 1279) Announcing shutdown
 INFO [StorageServiceShutdownHook] 2014-08-05 19:14:53,930
MessagingService.java (line 683) Waiting for messaging service to
quiesce
 INFO [ACCEPT-/127.0.0.10] 2014-08-05 19:14:53,931
MessagingService.java (line 923) MessagingService has terminated the
accept() thread

Does anyone have any ideas about how to debug this?  Looking around on
google I found some threads suggesting that this could occur from an
OOM error 
(http://stackoverflow.com/questions/23755040/cassandra-exits-with-no-errors).
Wouldn't such an error be logged, however?

The test that fails is a test of our MapReduce Hadoop InputFormat and
as such it does some pretty big queries across multiple rows (over a
range of partitioning key tokens).  The default fetch size I believe
is 5000 rows, and the values in the rows I am fetching are just simple
strings, so I would not think the amount of data in a single read
would be too big.

FWIW I don't see any log messages about garbage collection for at
least 3min before the process shuts down (and no GC messages after the
test stops doing writes and starts doing reads).

I'd greatly appreciate any help before my team kills me for breaking
our Jenkins build so consistently!  :)

Best regards,
Clint

Reply via email to