[
https://issues.apache.org/jira/browse/CASSANDRA-7392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14740075#comment-14740075
]
Stefania commented on CASSANDRA-7392:
-------------------------------------
Thanks for the detailed explanation on the gains of lazySet vs CAS, since we
don't care about a slight imprecision with logging, let's keep lazySet.
bq. I see options in the YAML as a bad thing and I also don't like undocumented
options. I also fear shipping with hard coded constants because it means there
is no option other than recompiling in the field.
bq. I see properties as bridging the gap between the next release where we
either fix the implementation so tuning the defaults isn't necessary or make it
an option in the YAML if we can't make it just work without operator
intervention.
I think I misunderstood what you meant by property, I moved them from
{{Config}} to static values in {{MonitoringTask}} that are set via
{{System.getProperty()}}.
bq. WRT to the default. I would say 1% of the timeout is pretty high precision.
You do have some insight there thinking about the frequency as a % of the
timeout. I would say go with that and set the check frequency as a % of the
timeout unless overridden by a property, but also set a minimum value. I think
50 milliseconds is good as a minimum. I would say 10% is good enough so we
would be off by around 1/10th of the timeout, but I don't feel strongly.
Sounds good, done. Enforcing a minimum of 50 milliseconds however slows down
the unit tests a bit, since it gets a bit messy to override the minimum as
well. The trouble is that the singleton is submitted for scheduling before we
can change any class field. I could move the properties to another class to
make it a bit cleaner.
bq. That log message is fixed size (% of verbs) and covers all dropped messages
and not just the subset of timeouts you are working on right now. I would say
leave it alone just by virtue of being out of scope.
Ok.
--
I rebased and submitted these new changes and I've also added min/max/avg
logging when we have more than 1 timeout rather than just displaying the
details of the first and last timeout.
--
bq. I think that's it. NoSpamLogger doesn't allow you to specify a policy for
backoff as opposed to fixed intervals. I think that is a missing capability.
Have it support a policy, and then tell you whether it is time to log so you
can decide whether to clear the stats.
bq. One caveat that occurs to me of logging this kind of thing at a variable
rate is that absolute counts of events are no longer informative. You need to
log a rate so you can compare without having to do your own math. Even then
there is some harm because you could grow the reporting interval to include
time where nothing is going wrong distorting the reported rate. I think there
is some tension between my pony and providing precise data. What do you think?
How would we calculate the rate without also storing the totals? I'm not sure
variable rate logging is the best way to go about it given that we are trying
to achieve a _poor man's "slow query log" for free_. The issue is how to avoid
polluting log files, so the effort required to support variable rate logging
would perhaps be better spent logging the timed-out queries elsewhere?
[~jbellis] and [~iamaleksey] WDYT?
> Abort in-progress queries that time out
> ---------------------------------------
>
> Key: CASSANDRA-7392
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7392
> Project: Cassandra
> Issue Type: New Feature
> Components: Core
> Reporter: Jonathan Ellis
> Assignee: Stefania
> Priority: Critical
> Fix For: 3.x
>
>
> Currently we drop queries that time out before we get to them (because node
> is overloaded) but not queries that time out while being processed.
> (Particularly common for index queries on data that shouldn't be indexed.)
> Adding the latter and logging when we have to interrupt one gets us a poor
> man's "slow query log" for free.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)