Just to pick this up: Did you see any system load spikes? I'm tracing a problem on 2.2.7 where my cluster sees load spikes up to 20-30, when the normal average load is around 3-4. So far I haven't found any good reason, but I'm going to try otc_coalescing_strategy: disabled tomorrow.
- Garo On Fri, Jul 15, 2016 at 6:16 PM, Mike Heffner <m...@librato.com> wrote: > Just to followup on this post with a couple of more data points: > > 1) > > We upgraded to 2.2.7 and did not see any change in behavior. > > 2) > > However, what *has* fixed this issue for us was disabling msg coalescing > by setting: > > otc_coalescing_strategy: DISABLED > > We were using the default setting before (time horizon I believe). > > We see periodic timeouts on the ring (once every few hours), but they are > brief and don't impact latency. With msg coalescing turned on we would see > these timeouts persist consistently after an initial spike. My guess is > that something in the coalescing logic is disturbed by the initial timeout > spike which leads to dropping all / high-percentage of all subsequent > traffic. > > We are planning to continue production use with msg coaleasing disabled > for now and may run tests in our staging environments to identify where the > coalescing is breaking this. > > Mike > > On Tue, Jul 5, 2016 at 12:14 PM, Mike Heffner <m...@librato.com> wrote: > >> Jeff, >> >> Thanks, yeah we updated to the 2.16.4 driver version from source. I don't >> believe we've hit the bugs mentioned in earlier driver versions. >> >> Mike >> >> On Mon, Jul 4, 2016 at 11:16 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com> >> wrote: >> >>> AWS ubuntu 14.04 AMI ships with buggy enhanced networking driver – >>> depending on your instance types / hypervisor choice, you may want to >>> ensure you’re not seeing that bug. >>> >>> >>> >>> *From: *Mike Heffner <m...@librato.com> >>> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org> >>> *Date: *Friday, July 1, 2016 at 1:10 PM >>> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org> >>> *Cc: *Peter Norton <p...@librato.com> >>> *Subject: *Re: Ring connection timeouts with 2.2.6 >>> >>> >>> >>> Jens, >>> >>> >>> >>> We haven't noticed any particular large GC operations or even >>> persistently high GC times. >>> >>> >>> >>> Mike >>> >>> >>> >>> On Thu, Jun 30, 2016 at 3:20 AM, Jens Rantil <jens.ran...@tink.se> >>> wrote: >>> >>> Hi, >>> >>> Could it be garbage collection occurring on nodes that are more heavily >>> loaded? >>> >>> Cheers, >>> Jens >>> >>> >>> >>> Den sön 26 juni 2016 05:22Mike Heffner <m...@librato.com> skrev: >>> >>> One thing to add, if we do a rolling restart of the ring the timeouts >>> disappear entirely for several hours and performance returns to normal. >>> It's as if something is leaking over time, but we haven't seen any >>> noticeable change in heap. >>> >>> >>> >>> On Thu, Jun 23, 2016 at 10:38 AM, Mike Heffner <m...@librato.com> wrote: >>> >>> Hi, >>> >>> >>> >>> We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that >>> is sitting at <25% CPU, doing mostly writes, and not showing any particular >>> long GC times/pauses. By all observed metrics the ring is healthy and >>> performing well. >>> >>> >>> >>> However, we are noticing a pretty consistent number of connection >>> timeouts coming from the messaging service between various pairs of nodes >>> in the ring. The "Connection.TotalTimeouts" meter metric show 100k's of >>> timeouts per minute, usually between two pairs of nodes for several hours >>> at a time. It seems to occur for several hours at a time, then may stop or >>> move to other pairs of nodes in the ring. The metric >>> "Connection.SmallMessageDroppedTasks.<ip>" will also grow for one pair of >>> the nodes in the TotalTimeouts metric. >>> >>> >>> >>> Looking at the debug log typically shows a large number of messages like >>> the following on one of the nodes: >>> >>> >>> >>> StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177 >>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__172.26.33.177&d=CwMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=KlMh_-rpcOH2Mdf3i2XGCQhtU4ZuD0Y37WpHKGlKtnQ&s=ihxNa3DwQPrfqEURi_UIncjESJC_XexR_AjY81coG8U&e=> >>> (ttl 0) >>> >>> We have cross node timeouts enabled, but ntp is running on all nodes and >>> no node appears to have time drift. >>> >>> >>> >>> The network appears to be fine between nodes, with iperf tests showing >>> that we have a lot of headroom. >>> >>> >>> >>> Any thoughts on what to look for? Can we increase thread count/pool >>> sizes for the messaging service? >>> >>> >>> >>> Thanks, >>> >>> >>> >>> Mike >>> >>> >>> >>> -- >>> >>> >>> Mike Heffner <m...@librato.com> >>> >>> Librato, Inc. >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> >>> Mike Heffner <m...@librato.com> >>> >>> Librato, Inc. >>> >>> >>> >>> -- >>> >>> Jens Rantil >>> Backend Developer @ Tink >>> >>> Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden >>> For urgent matters you can reach me at +46-708-84 18 32. >>> >>> >>> >>> >>> >>> -- >>> >>> >>> Mike Heffner <m...@librato.com> >>> >>> Librato, Inc. >>> >>> >>> >> >> >> >> -- >> >> Mike Heffner <m...@librato.com> >> Librato, Inc. >> >> > > > -- > > Mike Heffner <m...@librato.com> > Librato, Inc. > >