[ 
https://issues.apache.org/jira/browse/CASSANDRA-7392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14740075#comment-14740075
 ] 

Stefania commented on CASSANDRA-7392:
-------------------------------------

Thanks for the detailed explanation on the gains of lazySet vs CAS, since we 
don't care about a slight imprecision with logging, let's keep lazySet.

bq. I see options in the YAML as a bad thing and I also don't like undocumented 
options. I also fear shipping with hard coded constants because it means there 
is no option other than recompiling in the field.

bq. I see properties as bridging the gap between the next release where we 
either fix the implementation so tuning the defaults isn't necessary or make it 
an option in the YAML if we can't make it just work without operator 
intervention.

I think I misunderstood what you meant by property, I moved them from 
{{Config}} to static values in {{MonitoringTask}} that are set via 
{{System.getProperty()}}.

bq. WRT to the default. I would say 1% of the timeout is pretty high precision. 
You do have some insight there thinking about the frequency as a % of the 
timeout. I would say go with that and set the check frequency as a % of the 
timeout unless overridden by a property, but also set a minimum value. I think 
50 milliseconds is good as a minimum. I would say 10% is good enough so we 
would be off by around 1/10th of the timeout, but I don't feel strongly.

Sounds good, done. Enforcing a minimum of 50 milliseconds however slows down 
the unit tests a bit, since it gets a bit messy to override the minimum as 
well. The trouble is that the singleton is submitted for scheduling before we 
can change any class field. I could move the properties to another class to 
make it a bit cleaner.

bq. That log message is fixed size (% of verbs) and covers all dropped messages 
and not just the subset of timeouts you are working on right now. I would say 
leave it alone just by virtue of being out of scope.

Ok.

--

I rebased and submitted these new changes and I've also added min/max/avg 
logging when we have more than 1 timeout rather than just displaying the 
details of the first and last timeout.

--

bq. I think that's it. NoSpamLogger doesn't allow you to specify a policy for 
backoff as opposed to fixed intervals. I think that is a missing capability. 
Have it support a policy, and then tell you whether it is time to log so you 
can decide whether to clear the stats.

bq. One caveat that occurs to me of logging this kind of thing at a variable 
rate is that absolute counts of events are no longer informative. You need to 
log a rate so you can compare without having to do your own math. Even then 
there is some harm because you could grow the reporting interval to include 
time where nothing is going wrong distorting the reported rate. I think there 
is some tension between my pony and providing precise data. What do you think?

How would we calculate the rate without also storing the totals? I'm not sure 
variable rate logging is the best way to go about it given that we are trying 
to achieve a _poor man's "slow query log" for free_. The issue is how to avoid 
polluting log files, so the effort required to support variable rate logging 
would perhaps be better spent logging the timed-out queries elsewhere?

[~jbellis] and [~iamaleksey] WDYT?

> Abort in-progress queries that time out
> ---------------------------------------
>
>                 Key: CASSANDRA-7392
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7392
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Stefania
>            Priority: Critical
>             Fix For: 3.x
>
>
> Currently we drop queries that time out before we get to them (because node 
> is overloaded) but not queries that time out while being processed.  
> (Particularly common for index queries on data that shouldn't be indexed.)  
> Adding the latter and logging when we have to interrupt one gets us a poor 
> man's "slow query log" for free.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to