Re: tracing improvements

2017-01-27 Thread Nate McCall
I do miss this from other RDBMSs. If you could come up with a
light-touch way to do this, I think a lot of people would be quite
happy about it.

On Wed, Jan 25, 2017 at 2:02 PM, Corentin Chary
 wrote:
> On Wed, Jan 25, 2017 at 9:55 PM, Sam Overton  wrote:
>> Hello cassandra-dev,
>>
>> I would like to continue the momentum on improving Cassandra's tracing,
>> following Mick's excellent work on pluggable tracing and Zipkin support.
>>
>> There are a couple of areas we can improve that would make tracing an even
>> more
>> useful tool for cluster operators to diagnose ongoing issues.
>>
>> The control we currently have over tracing is coarse and somewhat
>> cumbersome.
>> Enabling tracing from the client for a specific query is fine for
>> application
>> developers, particularly in an environment where Zipkin is being used to
>> trace
>> all parts of the system and show an aggregated view. For an operator
>> investigating an issue however, this does not always give us the control
>> that we
>> need in order to obtain relevant data. We often need to diagnose an issue
>> without the possibility of making any changes in the client, and often
>> without
>> the prior knowledge of which queries at the application level are
>> experiencing
>> poor performance.
>>
>> Our only other instigator of tracing is nodetool settraceprobability which
>> only
>> affects a single node and gives us no control over precisely which queries
>> get
>> traced. In practise, it is very difficult to find the relevant queries that
>> we
>> want to investigate, so we have often resorted to bulk loading the traces
>> into
>> an external tool for analysis, and this seems sub-optimal when cassandra
>> could
>> reduce much of the friction.
>>
>> I have a few proposals to improve tracing that I'd like to throw out to
>> the mailing list to get feedback before I start implementing.
>>
>> 1. Include trace_probability as a CF level property, so sampled tracing can
>> be
>> enabled on a per-CF basis, cluster-wide, by changing the CF property.
>> https://issues.apache.org/jira/browse/CASSANDRA-13154
>>
>> 2. Allow tracing at the CFS level. If we have a misbehaving host, then it
>> would
>> be useful to enable sampled tracing at the CFS layer on just that host so
>> that
>> we can investigate queries landing on that replica, rather than just queries
>> passing through as a coordinator as is currently possible.
>> https://issues.apache.org/jira/browse/CASSANDRA-13155
>>
>> 3. Add an interface allowing for custom filters which can decide whether
>> tracing
>> should be enabled for a given query. This is a similar idea to
>> CASSANDRA-9193
>> [1] but following the same pattern that we have for IAuthenticator,
>> IEndpointSnitch, ConfigurationLoader et al. where the intention is that
>> useful
>> default implementations are provided, but abstracted in such a way that
>> custom
>> implementations can be written for deployments where a specific type of
>> functionality is required. This would then allow solutions such as
>> CASSANDRA-11012 [2] without any specific support needing to be written in
>> Cassandra.
>> https://issues.apache.org/jira/browse/CASSANDRA-13156
>>
>> Thanks for reading!
>> Regards,
>>
>> Sam
>>
>>
>> [1] https://issues.apache.org/jira/browse/CASSANDRA-9193 Facility to write
>> dynamic
>> code to selectively trigger trace or log for queries
>>
>> [2] https://issues.apache.org/jira/browse/CASSANDRA-11012 Allow tracing CQL
>> of a
>> specific client only, based on IP (range)
>
> Not directly related, but to make (3) more useful it would also be
> great to be able to list currently executing queries. I've had
> multiple cases where read queries would just use all my slots and
> never finish and it was quite painful to discover what the query was
> exactly (slow query don't help if the query never finishes).
>
>
> --
> Corentin Chary
> http://xf.iksaif.net


Re: tracing improvements

2017-01-27 Thread Nate McCall
I think all three of these have merit. Per-CF tracing would be the
most immediately useful (and likely least impactful).

For #3, I like the interface approach over exposing internal APIs. You
can sort of kind of do this with custom QueryProcessor, but having
something specific to tracing would be nice.

Thanks for opening these!

On Wed, Jan 25, 2017 at 12:55 PM, Sam Overton  wrote:
> Hello cassandra-dev,
>
> I would like to continue the momentum on improving Cassandra's tracing,
> following Mick's excellent work on pluggable tracing and Zipkin support.
>
> There are a couple of areas we can improve that would make tracing an even
> more
> useful tool for cluster operators to diagnose ongoing issues.
>
> The control we currently have over tracing is coarse and somewhat
> cumbersome.
> Enabling tracing from the client for a specific query is fine for
> application
> developers, particularly in an environment where Zipkin is being used to
> trace
> all parts of the system and show an aggregated view. For an operator
> investigating an issue however, this does not always give us the control
> that we
> need in order to obtain relevant data. We often need to diagnose an issue
> without the possibility of making any changes in the client, and often
> without
> the prior knowledge of which queries at the application level are
> experiencing
> poor performance.
>
> Our only other instigator of tracing is nodetool settraceprobability which
> only
> affects a single node and gives us no control over precisely which queries
> get
> traced. In practise, it is very difficult to find the relevant queries that
> we
> want to investigate, so we have often resorted to bulk loading the traces
> into
> an external tool for analysis, and this seems sub-optimal when cassandra
> could
> reduce much of the friction.
>
> I have a few proposals to improve tracing that I'd like to throw out to
> the mailing list to get feedback before I start implementing.
>
> 1. Include trace_probability as a CF level property, so sampled tracing can
> be
> enabled on a per-CF basis, cluster-wide, by changing the CF property.
> https://issues.apache.org/jira/browse/CASSANDRA-13154
>
> 2. Allow tracing at the CFS level. If we have a misbehaving host, then it
> would
> be useful to enable sampled tracing at the CFS layer on just that host so
> that
> we can investigate queries landing on that replica, rather than just queries
> passing through as a coordinator as is currently possible.
> https://issues.apache.org/jira/browse/CASSANDRA-13155
>
> 3. Add an interface allowing for custom filters which can decide whether
> tracing
> should be enabled for a given query. This is a similar idea to
> CASSANDRA-9193
> [1] but following the same pattern that we have for IAuthenticator,
> IEndpointSnitch, ConfigurationLoader et al. where the intention is that
> useful
> default implementations are provided, but abstracted in such a way that
> custom
> implementations can be written for deployments where a specific type of
> functionality is required. This would then allow solutions such as
> CASSANDRA-11012 [2] without any specific support needing to be written in
> Cassandra.
> https://issues.apache.org/jira/browse/CASSANDRA-13156
>
> Thanks for reading!
> Regards,
>
> Sam
>
>
> [1] https://issues.apache.org/jira/browse/CASSANDRA-9193 Facility to write
> dynamic
> code to selectively trigger trace or log for queries
>
> [2] https://issues.apache.org/jira/browse/CASSANDRA-11012 Allow tracing CQL
> of a
> specific client only, based on IP (range)


Re: tracing improvements

2017-01-25 Thread Corentin Chary
On Wed, Jan 25, 2017 at 9:55 PM, Sam Overton  wrote:
> Hello cassandra-dev,
>
> I would like to continue the momentum on improving Cassandra's tracing,
> following Mick's excellent work on pluggable tracing and Zipkin support.
>
> There are a couple of areas we can improve that would make tracing an even
> more
> useful tool for cluster operators to diagnose ongoing issues.
>
> The control we currently have over tracing is coarse and somewhat
> cumbersome.
> Enabling tracing from the client for a specific query is fine for
> application
> developers, particularly in an environment where Zipkin is being used to
> trace
> all parts of the system and show an aggregated view. For an operator
> investigating an issue however, this does not always give us the control
> that we
> need in order to obtain relevant data. We often need to diagnose an issue
> without the possibility of making any changes in the client, and often
> without
> the prior knowledge of which queries at the application level are
> experiencing
> poor performance.
>
> Our only other instigator of tracing is nodetool settraceprobability which
> only
> affects a single node and gives us no control over precisely which queries
> get
> traced. In practise, it is very difficult to find the relevant queries that
> we
> want to investigate, so we have often resorted to bulk loading the traces
> into
> an external tool for analysis, and this seems sub-optimal when cassandra
> could
> reduce much of the friction.
>
> I have a few proposals to improve tracing that I'd like to throw out to
> the mailing list to get feedback before I start implementing.
>
> 1. Include trace_probability as a CF level property, so sampled tracing can
> be
> enabled on a per-CF basis, cluster-wide, by changing the CF property.
> https://issues.apache.org/jira/browse/CASSANDRA-13154
>
> 2. Allow tracing at the CFS level. If we have a misbehaving host, then it
> would
> be useful to enable sampled tracing at the CFS layer on just that host so
> that
> we can investigate queries landing on that replica, rather than just queries
> passing through as a coordinator as is currently possible.
> https://issues.apache.org/jira/browse/CASSANDRA-13155
>
> 3. Add an interface allowing for custom filters which can decide whether
> tracing
> should be enabled for a given query. This is a similar idea to
> CASSANDRA-9193
> [1] but following the same pattern that we have for IAuthenticator,
> IEndpointSnitch, ConfigurationLoader et al. where the intention is that
> useful
> default implementations are provided, but abstracted in such a way that
> custom
> implementations can be written for deployments where a specific type of
> functionality is required. This would then allow solutions such as
> CASSANDRA-11012 [2] without any specific support needing to be written in
> Cassandra.
> https://issues.apache.org/jira/browse/CASSANDRA-13156
>
> Thanks for reading!
> Regards,
>
> Sam
>
>
> [1] https://issues.apache.org/jira/browse/CASSANDRA-9193 Facility to write
> dynamic
> code to selectively trigger trace or log for queries
>
> [2] https://issues.apache.org/jira/browse/CASSANDRA-11012 Allow tracing CQL
> of a
> specific client only, based on IP (range)

Not directly related, but to make (3) more useful it would also be
great to be able to list currently executing queries. I've had
multiple cases where read queries would just use all my slots and
never finish and it was quite painful to discover what the query was
exactly (slow query don't help if the query never finishes).


-- 
Corentin Chary
http://xf.iksaif.net


tracing improvements

2017-01-25 Thread Sam Overton
Hello cassandra-dev,

I would like to continue the momentum on improving Cassandra's tracing,
following Mick's excellent work on pluggable tracing and Zipkin support.

There are a couple of areas we can improve that would make tracing an even
more
useful tool for cluster operators to diagnose ongoing issues.

The control we currently have over tracing is coarse and somewhat
cumbersome.
Enabling tracing from the client for a specific query is fine for
application
developers, particularly in an environment where Zipkin is being used to
trace
all parts of the system and show an aggregated view. For an operator
investigating an issue however, this does not always give us the control
that we
need in order to obtain relevant data. We often need to diagnose an issue
without the possibility of making any changes in the client, and often
without
the prior knowledge of which queries at the application level are
experiencing
poor performance.

Our only other instigator of tracing is nodetool settraceprobability which
only
affects a single node and gives us no control over precisely which queries
get
traced. In practise, it is very difficult to find the relevant queries that
we
want to investigate, so we have often resorted to bulk loading the traces
into
an external tool for analysis, and this seems sub-optimal when cassandra
could
reduce much of the friction.

I have a few proposals to improve tracing that I'd like to throw out to
the mailing list to get feedback before I start implementing.

1. Include trace_probability as a CF level property, so sampled tracing can
be
enabled on a per-CF basis, cluster-wide, by changing the CF property.
https://issues.apache.org/jira/browse/CASSANDRA-13154

2. Allow tracing at the CFS level. If we have a misbehaving host, then it
would
be useful to enable sampled tracing at the CFS layer on just that host so
that
we can investigate queries landing on that replica, rather than just queries
passing through as a coordinator as is currently possible.
https://issues.apache.org/jira/browse/CASSANDRA-13155

3. Add an interface allowing for custom filters which can decide whether
tracing
should be enabled for a given query. This is a similar idea to
CASSANDRA-9193
[1] but following the same pattern that we have for IAuthenticator,
IEndpointSnitch, ConfigurationLoader et al. where the intention is that
useful
default implementations are provided, but abstracted in such a way that
custom
implementations can be written for deployments where a specific type of
functionality is required. This would then allow solutions such as
CASSANDRA-11012 [2] without any specific support needing to be written in
Cassandra.
https://issues.apache.org/jira/browse/CASSANDRA-13156

Thanks for reading!
Regards,

Sam


[1] https://issues.apache.org/jira/browse/CASSANDRA-9193 Facility to write
dynamic
code to selectively trigger trace or log for queries

[2] https://issues.apache.org/jira/browse/CASSANDRA-11012 Allow tracing CQL
of a
specific client only, based on IP (range)