We have a monitoring service that runs on all of our Cassandra nodes which
performs different query types to ensure the cluster is healthy.  We use
different consistency levels for the queries and alert if any of them
fail.  All of our query types consistently succeed apart from our ALL range
query which recently has started failing around once per hour across our
nodes (we run nine at present).  I would assume network issues, but that’s
unlikely given it’s been consistently failing on random nodes now for
around a week.

The error causing the failures is the following:

#<Cassandra::Errors::NoHostsAvailable: All attempted hosts failed: 0.0.0.0
(Cassandra::Errors::ReadTimeoutError: Operation timed out - received only 5
responses.)>”}'

Few things to bear in mind:

   - We use a range query, in spite of the obvious issues with range
   queries, because we are doing a data consistency check. We want to make
   sure the range query returns the 5k (tiny) rows we expect in that column
   family.  This has worked to date so not sure why it’s an issue.
   - The error states that there is a time out with only 5 responses.  I
   assume with an ALL query it should be nine given we have nine servers.
   - We are using the Ruby client lib for these health consistency checks
   and we are explicitly setting the timeout to 60s.  I am not convinced that
   timeout is being passed to C* from the client, but instead used by the
   client itself.
   - We are running Cassandra version: 2.1.13.1218.


Looking in the logs of the server that accepted the query at the time of
the failure, all I can see is the following:

INFO  [CqlSlowLog-Writer-thread-0] 2017-06-28 07:18:42,027
 CqlSlowLogWriter.java:151 - Recording statements with duration of 3550 in
slow log
INFO  [CqlSlowLog-Writer-thread-0] 2017-06-28 07:18:46,904
 CqlSlowLogWriter.java:151 - Recording statements with duration of 10118 in
slow log
INFO  [Service Thread] 2017-06-28 07:20:19,766  GCInspector.java:258 - G1
Young Generation GC in 224ms.  G1 Eden Space: 3657433088 -> 0; G1 Old Gen:
3320840704 -> 3419331072; G1 Survivor Space: 373293056 -> 3103
78496;


Any help or advice on why we would be seeing these intermittent yet
persistent failures would be appreciated.

-- 

Regards,

Matthew O'Riordan
CEO who codes
Ably - simply better realtime <https://www.ably.io/>

*Ably News: Ably push notifications have gone live
<https://blog.ably.io/ably-push-notifications-are-now-available-64cb8ae37e74>*

Reply via email to