Just to pick this up: Did you see any system load spikes? I'm tracing a
problem on 2.2.7 where my cluster sees load spikes up to 20-30, when the
normal average load is around 3-4. So far I haven't found any good reason,
but I'm going to try otc_coalescing_strategy: disabled tomorrow.

 - Garo

On Fri, Jul 15, 2016 at 6:16 PM, Mike Heffner <m...@librato.com> wrote:

> Just to followup on this post with a couple of more data points:
>
> 1)
>
> We upgraded to 2.2.7 and did not see any change in behavior.
>
> 2)
>
> However, what *has* fixed this issue for us was disabling msg coalescing
> by setting:
>
> otc_coalescing_strategy: DISABLED
>
> We were using the default setting before (time horizon I believe).
>
> We see periodic timeouts on the ring (once every few hours), but they are
> brief and don't impact latency. With msg coalescing turned on we would see
> these timeouts persist consistently after an initial spike. My guess is
> that something in the coalescing logic is disturbed by the initial timeout
> spike which leads to dropping all / high-percentage of all subsequent
> traffic.
>
> We are planning to continue production use with msg coaleasing disabled
> for now and may run tests in our staging environments to identify where the
> coalescing is breaking this.
>
> Mike
>
> On Tue, Jul 5, 2016 at 12:14 PM, Mike Heffner <m...@librato.com> wrote:
>
>> Jeff,
>>
>> Thanks, yeah we updated to the 2.16.4 driver version from source. I don't
>> believe we've hit the bugs mentioned in earlier driver versions.
>>
>> Mike
>>
>> On Mon, Jul 4, 2016 at 11:16 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com>
>> wrote:
>>
>>> AWS ubuntu 14.04 AMI ships with buggy enhanced networking driver –
>>> depending on your instance types / hypervisor choice, you may want to
>>> ensure you’re not seeing that bug.
>>>
>>>
>>>
>>> *From: *Mike Heffner <m...@librato.com>
>>> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
>>> *Date: *Friday, July 1, 2016 at 1:10 PM
>>> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
>>> *Cc: *Peter Norton <p...@librato.com>
>>> *Subject: *Re: Ring connection timeouts with 2.2.6
>>>
>>>
>>>
>>> Jens,
>>>
>>>
>>>
>>> We haven't noticed any particular large GC operations or even
>>> persistently high GC times.
>>>
>>>
>>>
>>> Mike
>>>
>>>
>>>
>>> On Thu, Jun 30, 2016 at 3:20 AM, Jens Rantil <jens.ran...@tink.se>
>>> wrote:
>>>
>>> Hi,
>>>
>>> Could it be garbage collection occurring on nodes that are more heavily
>>> loaded?
>>>
>>> Cheers,
>>> Jens
>>>
>>>
>>>
>>> Den sön 26 juni 2016 05:22Mike Heffner <m...@librato.com> skrev:
>>>
>>> One thing to add, if we do a rolling restart of the ring the timeouts
>>> disappear entirely for several hours and performance returns to normal.
>>> It's as if something is leaking over time, but we haven't seen any
>>> noticeable change in heap.
>>>
>>>
>>>
>>> On Thu, Jun 23, 2016 at 10:38 AM, Mike Heffner <m...@librato.com> wrote:
>>>
>>> Hi,
>>>
>>>
>>>
>>> We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that
>>> is sitting at <25% CPU, doing mostly writes, and not showing any particular
>>> long GC times/pauses. By all observed metrics the ring is healthy and
>>> performing well.
>>>
>>>
>>>
>>> However, we are noticing a pretty consistent number of connection
>>> timeouts coming from the messaging service between various pairs of nodes
>>> in the ring. The "Connection.TotalTimeouts" meter metric show 100k's of
>>> timeouts per minute, usually between two pairs of nodes for several hours
>>> at a time. It seems to occur for several hours at a time, then may stop or
>>> move to other pairs of nodes in the ring. The metric
>>> "Connection.SmallMessageDroppedTasks.<ip>" will also grow for one pair of
>>> the nodes in the TotalTimeouts metric.
>>>
>>>
>>>
>>> Looking at the debug log typically shows a large number of messages like
>>> the following on one of the nodes:
>>>
>>>
>>>
>>> StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__172.26.33.177&d=CwMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=KlMh_-rpcOH2Mdf3i2XGCQhtU4ZuD0Y37WpHKGlKtnQ&s=ihxNa3DwQPrfqEURi_UIncjESJC_XexR_AjY81coG8U&e=>
>>> (ttl 0)
>>>
>>> We have cross node timeouts enabled, but ntp is running on all nodes and
>>> no node appears to have time drift.
>>>
>>>
>>>
>>> The network appears to be fine between nodes, with iperf tests showing
>>> that we have a lot of headroom.
>>>
>>>
>>>
>>> Any thoughts on what to look for? Can we increase thread count/pool
>>> sizes for the messaging service?
>>>
>>>
>>>
>>> Thanks,
>>>
>>>
>>>
>>> Mike
>>>
>>>
>>>
>>> --
>>>
>>>
>>>   Mike Heffner <m...@librato.com>
>>>
>>>   Librato, Inc.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>>
>>>   Mike Heffner <m...@librato.com>
>>>
>>>   Librato, Inc.
>>>
>>>
>>>
>>> --
>>>
>>> Jens Rantil
>>> Backend Developer @ Tink
>>>
>>> Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden
>>> For urgent matters you can reach me at +46-708-84 18 32.
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>>
>>>   Mike Heffner <m...@librato.com>
>>>
>>>   Librato, Inc.
>>>
>>>
>>>
>>
>>
>>
>> --
>>
>>   Mike Heffner <m...@librato.com>
>>   Librato, Inc.
>>
>>
>
>
> --
>
>   Mike Heffner <m...@librato.com>
>   Librato, Inc.
>
>

Reply via email to