Re: ALL range query monitors failing frequently

2017-06-29 Thread Matthew O'Riordan
Thanks Kurt, I appreciate that feedback.

I’ll investigate the metrics more fully and come back with my finding.

In terms of logs, I did look in the logs of the nodes and found nothing I
am afraid.

On Wed, Jun 28, 2017 at 11:33 PM, kurt greaves  wrote:

> I'd say that no, a range query probably isn't the best for monitoring, but
> it really depends on how important it is that the range you select is
> consistent.
>
> From those traces it does seem that the bulk of the time spent was waiting
> for responses from the replicas, which may indicate a network issue, but
> it's not conclusive evidence.
>
> For SSTables you could check the SSTables per read of the query, but it's
> unnecessary as the traces indicate that's not the issue. Might be worth
> trying to debug potential network issues. Might be worth looking into
> metrics like CoordinatorReadLatency and CoordinatorScanLatency at the table
> level https://cassandra.apache.org/doc/latest/
> operating/metrics.html#table-metrics
> Also if you have any network traffic metrics between nodes would be a good
> place to look.
>
> ​Other than that I'd look in the logs on each node when you run the trace
> and try and identify any errors that could be causing problems.
>



-- 

Regards,

Matthew O'Riordan
CEO who codes
Ably - simply better realtime 

*Ably News: Ably push notifications have gone live
*


Re: ALL range query monitors failing frequently

2017-06-28 Thread kurt greaves
I'd say that no, a range query probably isn't the best for monitoring, but
it really depends on how important it is that the range you select is
consistent.

>From those traces it does seem that the bulk of the time spent was waiting
for responses from the replicas, which may indicate a network issue, but
it's not conclusive evidence.

For SSTables you could check the SSTables per read of the query, but it's
unnecessary as the traces indicate that's not the issue. Might be worth
trying to debug potential network issues. Might be worth looking into
metrics like CoordinatorReadLatency and CoordinatorScanLatency at the table
level
https://cassandra.apache.org/doc/latest/operating/metrics.html#table-metrics
Also if you have any network traffic metrics between nodes would be a good
place to look.

​Other than that I'd look in the logs on each node when you run the trace
and try and identify any errors that could be causing problems.


Re: ALL range query monitors failing frequently

2017-06-28 Thread Matthew O'Riordan
Hi Kurt

Thanks for the response.  Few comments in line:

On Wed, Jun 28, 2017 at 1:17 PM, kurt greaves  wrote:

> You're correct in that the timeout is only driver side. The server will
> have its own timeouts configured in the cassandra.yaml file.
>
Yup, OK.

I suspect either that you have a node down in your cluster (or 4),
>
Nope, that’s not what is happening as a) we have monitoring on all nodes,
b) there is nothing in the logs.


> or your queries are gradually getting slower.
>
Perhaps, but we have query time metrics that don’t seem to indicate any
obvious issues.  See the attached metrics from the last 12 hours for quorum
queries.


> This kind of aligns with the slow query statements in your logs. Are you
> making changes/updates to the partitions that you are querying?
>
No


> It could be that the partitions are now spread across multiple SSTables
> and thus slowing things down. You should perform a trace to get a better
> idea of the issue.
>
If I run a CONSISTENCY QUORUM | ALL range query, it is visually very slow
using cqlsh and unfortunately results in a trace failure: “Statement trace
did not complete within 10 seconds”.

A hacky workaround would be to increase your read timeouts server side
> (read_timeout_in_ms), however this will mask underlying data model issues.
>
Yup, I certainly don’t like the idea of that.

I’m interested in what you said about the partitions being spread across
multiple SSTables.  Any pointers on what to look for there?

I then wondered if perhaps a range query is really just not a good idea,
even if only for monitoring purposes.  I tried querying for just one row
with the ID specified i.e. something like SELECT * from keyspace.table
where id = 123;  It was still incredibly slow (with CONSISTENCY ALL) and
failed a few times to generate a trace, but finally resulted in a trace
that can be seen at
https://gist.github.com/mattheworiordan/b1133008bf6fd14bfe6937a0004c8789#file-cassandra-trace-log
.

The worse offender seemed to be 34.207.246.175, so I ran the same query on
that instance itself to see if it is under load / servicing requests slowly
and it’s not. See
https://gist.github.com/mattheworiordan/b1133008bf6fd14bfe6937a0004c8789#file-local-cassandra-trace-log
.

So as far as I can tell, it looks like there may be some issue with nodes
communicating with each other perhaps, but the logs don’t reveal much.
Where to now?

-- 

Regards,

Matthew O'Riordan
CEO who codes
Ably - simply better realtime 

*Ably News: Ably push notifications have gone live
*

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: ALL range query monitors failing frequently

2017-06-28 Thread kurt greaves
You're correct in that the timeout is only driver side. The server will
have its own timeouts configured in the cassandra.yaml file.

I suspect either that you have a node down in your cluster (or 4), or your
queries are gradually getting slower. This kind of aligns with the slow
query statements in your logs. Are you making changes/updates to the
partitions that you are querying? It could be that the partitions are now
spread across multiple SSTables and thus slowing things down. You should
perform a trace to get a better idea of the issue.

A hacky workaround would be to increase your read timeouts server side
(read_timeout_in_ms), however this will mask underlying data model issues.
​