[ 
https://issues.apache.org/jira/browse/CASSANDRA-8907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731797#comment-14731797
 ] 

Anuj Wadehra edited comment on CASSANDRA-8907 at 9/5/15 1:31 PM:
-----------------------------------------------------------------

[~johnny15676] Got your point !!! I think there are 2 scenarios: 

1. Suppose we enable gc pause warn limit by default and set it to some value 
say 200ms. Our guess may be wrong as an application could be comfortable with 
200ms gc pause. Now, if someone upgrades to minor version then they will 
'break' their log monitoring/warning system (as you said) as new Warnings for 
gc pauses greater than 200 ms are undesirable.

2. Suppose a user's nodes are getting down intermittently due to long GC pauses 
(20+ secs) but their log monitoring Warning system is comfortable and not  
reporting any issue. This is a BUG. Now, if such users upgrade with gc pause 
warn limit enabled by default and set to a much higher value say 20000ms and 
they start getting these Warnings in case they get adhoc huge gc pauses over 20 
sec. I wont call it 'breaking' their log monitoring system as it is a serious 
issue else their nodes will go DOWN intermittently without raising Warnings.

So, I advocate targetting scenario 2, where this property is enabled by default 
and set to a very high value (20000+ ms) . This way we are not trying guess 
what gc pause are Ok for a user application. Whether 100ms or 200ms or 1000ms 
is comfortable for an application . But, at the same time we raise the warning, 
when there is a serious gc pause which may cause a node being marked down. A 
user must get Warnings if such huge gc pauses are happening.

Any user upgrading to minor version will have the option to decrease the value 
based on his application requirements or leave it as it is. 

Moreover, If you agree with my above mentioned opinion, I would suggest that 
tpstats should be logged at  min(1000ms,gc warn threshold e.g. 20000). If a 
user is sensitive to gc pauses greater than 100ms, he will reduce the gc warn 
threshold to a lower value 100 ms and then he will see diagnostic tpstats info 
every time a gc pause over 100ms occurs. If user doesnt change the HUGE default 
gc warn limit (20000+) enabled by default, we would stick to existing way i.e. 
dump tpstats at gc pauses more than 1000ms (currently hard coded) to avoid 
breaking existing way of dumping tpstats.   

Small concern so we can quickly discuss and close it :)


was (Author: eanujwa):
[~johnny15676] Got your point !!! I think there are 2 scenarios: 

1. Suppose we enable gc pause warn limit by default and set it to some value 
say 200ms. Our guess may be wrong as an application could be comfortable with 
200ms gc pause. Now, if someone upgrades to minor version then they will 
'break' their log monitoring/warning system (as you said) as new Warnings for 
gc pauses greater than 200 ms are undesirable.

2. Suppose a user's nodes are getting down intermittently due to long GC pauses 
(20+ secs) but their log monitoring Warning system is comfortable and not  
reporting any issue. This is a BUG. Now, if such users upgrade with gc pause 
warn limit enabled by default and set to a much higher value say 20000ms and 
they start getting these Warnings in case they get adhoc huge gc pauses over 20 
sec. I wont call it 'breaking' their log monitoring system as it is a serious 
issue else their nodes will go DOWN intermittently without raising Warnings.

So, I advocate targetting scenario 2, where this property is enabled by default 
and set to a very high value (20000+ ms) . This way we are not trying guess 
what gc pause are Ok for a user application. Whether 100ms or 200ms or 1000ms 
is comfortable for an application . But, at the same time we raise the warning, 
when there is a serious gc pause which may cause a node being marked down. A 
user must get Warnings if such huge gc pauses are happening.

Any user upgrading to minor version will have the option to decrease the value 
based on his application requirements or leave it as it is. 

Moreover, If you agree with my above mentioned opinion, I would suggest that 
tpstats should be logged at  min(1000ms,gc warn threshold e.g. 20000). If user 
application is sensitive to gc pauses, he will reduce the gc warn threshold to 
a lower value e.g. 100 ms and then he will see diagnostic tpstats info every 
time a gc pause over 100ms occurs. If user doesnt change the HUGE default gc 
warn limit (20000+) enabled by default, we would stick to existing way i.e. 
dump tpstats at gc pauses more than 1000ms to avoid breaking existing way of 
dumping tpstats.   

Small concern so we can quickly discuss and close it :)

> Raise GCInspector alerts to WARN
> --------------------------------
>
>                 Key: CASSANDRA-8907
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8907
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Adam Hattrell
>            Assignee: Amit Singh Chowdhery
>              Labels: patch
>         Attachments: cassnadra-8907.patch
>
>
> I'm fairly regularly running into folks wondering why their applications are 
> reporting down nodes.  Yet, they report, when they grepped the logs they have 
> no WARN or ERRORs listed.
> Nine times out of ten, when I look through the logs we see a ton of ParNew or 
> CMS gc pauses occurring similar to the following:
> INFO [ScheduledTasks:1] 2013-03-07 18:44:46,795 GCInspector.java (line 122) 
> GC for ConcurrentMarkSweep: 1835 ms for 3 collections, 2606015656 used; max 
> is 10611589120
> INFO [ScheduledTasks:1] 2013-03-07 19:45:08,029 GCInspector.java (line 122) 
> GC for ParNew: 9866 ms for 8 collections, 2910124308 used; max is 6358564864
> To my mind these should be WARN's as they have the potential to be 
> significantly impacting the clusters performance as a whole.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to