We have a monitoring service that runs on all of our Cassandra nodes which performs different query types to ensure the cluster is healthy. We use different consistency levels for the queries and alert if any of them fail. All of our query types consistently succeed apart from our ALL range query which recently has started failing around once per hour across our nodes (we run nine at present). I would assume network issues, but that’s unlikely given it’s been consistently failing on random nodes now for around a week.
The error causing the failures is the following: #<Cassandra::Errors::NoHostsAvailable: All attempted hosts failed: 0.0.0.0 (Cassandra::Errors::ReadTimeoutError: Operation timed out - received only 5 responses.)>”}' Few things to bear in mind: - We use a range query, in spite of the obvious issues with range queries, because we are doing a data consistency check. We want to make sure the range query returns the 5k (tiny) rows we expect in that column family. This has worked to date so not sure why it’s an issue. - The error states that there is a time out with only 5 responses. I assume with an ALL query it should be nine given we have nine servers. - We are using the Ruby client lib for these health consistency checks and we are explicitly setting the timeout to 60s. I am not convinced that timeout is being passed to C* from the client, but instead used by the client itself. - We are running Cassandra version: 2.1.13.1218. Looking in the logs of the server that accepted the query at the time of the failure, all I can see is the following: INFO [CqlSlowLog-Writer-thread-0] 2017-06-28 07:18:42,027 CqlSlowLogWriter.java:151 - Recording statements with duration of 3550 in slow log INFO [CqlSlowLog-Writer-thread-0] 2017-06-28 07:18:46,904 CqlSlowLogWriter.java:151 - Recording statements with duration of 10118 in slow log INFO [Service Thread] 2017-06-28 07:20:19,766 GCInspector.java:258 - G1 Young Generation GC in 224ms. G1 Eden Space: 3657433088 -> 0; G1 Old Gen: 3320840704 -> 3419331072; G1 Survivor Space: 373293056 -> 3103 78496; Any help or advice on why we would be seeing these intermittent yet persistent failures would be appreciated. -- Regards, Matthew O'Riordan CEO who codes Ably - simply better realtime <https://www.ably.io/> *Ably News: Ably push notifications have gone live <https://blog.ably.io/ably-push-notifications-are-now-available-64cb8ae37e74>*