Hi Dikang,

Do you have any GC logging or metrics you can correlate with the dropped 
messages? A 13 second pause sounds like a bad GC pause.

Thanks,

Blake


On January 22, 2017 at 10:37:22 PM, Dikang Gu (dikan...@gmail.com) wrote:

Btw, the C* version is 2.2.5, with several backported patches. 

On Sun, Jan 22, 2017 at 10:36 PM, Dikang Gu <dikan...@gmail.com> wrote: 

> Hello there, 
> 
> We have a 100 nodes ish cluster, I find that there are dropped messages on 
> random nodes in the cluster, which caused error spikes and P99 latency 
> spikes as well. 
> 
> I tried to figure out the cause. I do not see any obvious bottleneck in 
> the cluster, the C* nodes still have plenty of cpu idle/disk io. But I do 
> see some suspicious gossip events around that time, not sure if it's 
> related. 
> 
> 2017-01-21_16:43:56.71033 WARN 16:43:56 [GossipTasks:1]: Not marking 
> nodes down due to local pause of 13079498815 > 5000000000 
> 2017-01-21_16:43:56.85532 INFO 16:43:56 [ScheduledTasks:1]: MUTATION 
> messages were dropped in last 5000 ms: 65 for internal timeout and 10895 
> for cross node timeout 
> 2017-01-21_16:43:56.85533 INFO 16:43:56 [ScheduledTasks:1]: READ messages 
> were dropped in last 5000 ms: 33 for internal timeout and 7867 for cross 
> node timeout 
> 2017-01-21_16:43:56.85534 INFO 16:43:56 [ScheduledTasks:1]: Pool Name 
> Active Pending Completed Blocked All Time Blocked 
> 2017-01-21_16:43:56.85534 INFO 16:43:56 [ScheduledTasks:1]: MutationStage 
> 128 47794 1015525068 0 0 
> 2017-01-21_16:43:56.85535 
> 2017-01-21_16:43:56.85535 INFO 16:43:56 [ScheduledTasks:1]: ReadStage 
> 64 20202 450508940 0 0 
> 
> Any suggestions? 
> 
> Thanks! 
> 
> -- 
> Dikang 
> 
> 


-- 
Dikang 

Reply via email to