Re: Ring connection timeouts with 2.2.6
Garo, No, we didn't notice any change in system load, just the expected spike in packet counts. Mike On Wed, Jul 20, 2016 at 3:49 PM, Juho Mäkinen <juho.maki...@gmail.com> wrote: > Just to pick this up: Did you see any system load spikes? I'm tracing a > problem on 2.2.7 where my cluster sees load spikes up to 20-30, when the > normal average load is around 3-4. So far I haven't found any good reason, > but I'm going to try otc_coalescing_strategy: disabled tomorrow. > > - Garo > > On Fri, Jul 15, 2016 at 6:16 PM, Mike Heffner <m...@librato.com> wrote: > >> Just to followup on this post with a couple of more data points: >> >> 1) >> >> We upgraded to 2.2.7 and did not see any change in behavior. >> >> 2) >> >> However, what *has* fixed this issue for us was disabling msg coalescing >> by setting: >> >> otc_coalescing_strategy: DISABLED >> >> We were using the default setting before (time horizon I believe). >> >> We see periodic timeouts on the ring (once every few hours), but they are >> brief and don't impact latency. With msg coalescing turned on we would see >> these timeouts persist consistently after an initial spike. My guess is >> that something in the coalescing logic is disturbed by the initial timeout >> spike which leads to dropping all / high-percentage of all subsequent >> traffic. >> >> We are planning to continue production use with msg coaleasing disabled >> for now and may run tests in our staging environments to identify where the >> coalescing is breaking this. >> >> Mike >> >> On Tue, Jul 5, 2016 at 12:14 PM, Mike Heffner <m...@librato.com> wrote: >> >>> Jeff, >>> >>> Thanks, yeah we updated to the 2.16.4 driver version from source. I >>> don't believe we've hit the bugs mentioned in earlier driver versions. >>> >>> Mike >>> >>> On Mon, Jul 4, 2016 at 11:16 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com> >>> wrote: >>> >>>> AWS ubuntu 14.04 AMI ships with buggy enhanced networking driver – >>>> depending on your instance types / hypervisor choice, you may want to >>>> ensure you’re not seeing that bug. >>>> >>>> >>>> >>>> *From: *Mike Heffner <m...@librato.com> >>>> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org> >>>> *Date: *Friday, July 1, 2016 at 1:10 PM >>>> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org> >>>> *Cc: *Peter Norton <p...@librato.com> >>>> *Subject: *Re: Ring connection timeouts with 2.2.6 >>>> >>>> >>>> >>>> Jens, >>>> >>>> >>>> >>>> We haven't noticed any particular large GC operations or even >>>> persistently high GC times. >>>> >>>> >>>> >>>> Mike >>>> >>>> >>>> >>>> On Thu, Jun 30, 2016 at 3:20 AM, Jens Rantil <jens.ran...@tink.se> >>>> wrote: >>>> >>>> Hi, >>>> >>>> Could it be garbage collection occurring on nodes that are more heavily >>>> loaded? >>>> >>>> Cheers, >>>> Jens >>>> >>>> >>>> >>>> Den sön 26 juni 2016 05:22Mike Heffner <m...@librato.com> skrev: >>>> >>>> One thing to add, if we do a rolling restart of the ring the timeouts >>>> disappear entirely for several hours and performance returns to normal. >>>> It's as if something is leaking over time, but we haven't seen any >>>> noticeable change in heap. >>>> >>>> >>>> >>>> On Thu, Jun 23, 2016 at 10:38 AM, Mike Heffner <m...@librato.com> >>>> wrote: >>>> >>>> Hi, >>>> >>>> >>>> >>>> We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that >>>> is sitting at <25% CPU, doing mostly writes, and not showing any particular >>>> long GC times/pauses. By all observed metrics the ring is healthy and >>>> performing well. >>>> >>>> >>>> >>>> However, we are noticing a pretty consistent number of connection >>>> timeouts coming from the messaging service between various pairs of nodes >>>> in the ring. The "Connection.TotalTimeouts" meter metric show 100k's of
Re: Ring connection timeouts with 2.2.6
Just to followup on this post with a couple of more data points: 1) We upgraded to 2.2.7 and did not see any change in behavior. 2) However, what *has* fixed this issue for us was disabling msg coalescing by setting: otc_coalescing_strategy: DISABLED We were using the default setting before (time horizon I believe). We see periodic timeouts on the ring (once every few hours), but they are brief and don't impact latency. With msg coalescing turned on we would see these timeouts persist consistently after an initial spike. My guess is that something in the coalescing logic is disturbed by the initial timeout spike which leads to dropping all / high-percentage of all subsequent traffic. We are planning to continue production use with msg coaleasing disabled for now and may run tests in our staging environments to identify where the coalescing is breaking this. Mike On Tue, Jul 5, 2016 at 12:14 PM, Mike Heffner <m...@librato.com> wrote: > Jeff, > > Thanks, yeah we updated to the 2.16.4 driver version from source. I don't > believe we've hit the bugs mentioned in earlier driver versions. > > Mike > > On Mon, Jul 4, 2016 at 11:16 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com> > wrote: > >> AWS ubuntu 14.04 AMI ships with buggy enhanced networking driver – >> depending on your instance types / hypervisor choice, you may want to >> ensure you’re not seeing that bug. >> >> >> >> *From: *Mike Heffner <m...@librato.com> >> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org> >> *Date: *Friday, July 1, 2016 at 1:10 PM >> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org> >> *Cc: *Peter Norton <p...@librato.com> >> *Subject: *Re: Ring connection timeouts with 2.2.6 >> >> >> >> Jens, >> >> >> >> We haven't noticed any particular large GC operations or even >> persistently high GC times. >> >> >> >> Mike >> >> >> >> On Thu, Jun 30, 2016 at 3:20 AM, Jens Rantil <jens.ran...@tink.se> wrote: >> >> Hi, >> >> Could it be garbage collection occurring on nodes that are more heavily >> loaded? >> >> Cheers, >> Jens >> >> >> >> Den sön 26 juni 2016 05:22Mike Heffner <m...@librato.com> skrev: >> >> One thing to add, if we do a rolling restart of the ring the timeouts >> disappear entirely for several hours and performance returns to normal. >> It's as if something is leaking over time, but we haven't seen any >> noticeable change in heap. >> >> >> >> On Thu, Jun 23, 2016 at 10:38 AM, Mike Heffner <m...@librato.com> wrote: >> >> Hi, >> >> >> >> We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that is >> sitting at <25% CPU, doing mostly writes, and not showing any particular >> long GC times/pauses. By all observed metrics the ring is healthy and >> performing well. >> >> >> >> However, we are noticing a pretty consistent number of connection >> timeouts coming from the messaging service between various pairs of nodes >> in the ring. The "Connection.TotalTimeouts" meter metric show 100k's of >> timeouts per minute, usually between two pairs of nodes for several hours >> at a time. It seems to occur for several hours at a time, then may stop or >> move to other pairs of nodes in the ring. The metric >> "Connection.SmallMessageDroppedTasks." will also grow for one pair of >> the nodes in the TotalTimeouts metric. >> >> >> >> Looking at the debug log typically shows a large number of messages like >> the following on one of the nodes: >> >> >> >> StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177 >> <https://urldefense.proofpoint.com/v2/url?u=http-3A__172.26.33.177=CwMFaQ=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow=KlMh_-rpcOH2Mdf3i2XGCQhtU4ZuD0Y37WpHKGlKtnQ=ihxNa3DwQPrfqEURi_UIncjESJC_XexR_AjY81coG8U=> >> (ttl 0) >> >> We have cross node timeouts enabled, but ntp is running on all nodes and >> no node appears to have time drift. >> >> >> >> The network appears to be fine between nodes, with iperf tests showing >> that we have a lot of headroom. >> >> >> >> Any thoughts on what to look for? Can we increase thread count/pool sizes >> for the messaging service? >> >> >> >> Thanks, >> >> >> >> Mike >> >> >> >> -- >> >> >> Mike Heffner <m...@librato.com> >> >> Librato, Inc. >> >> >> >> >> >> >> >> -- >> >> >> Mike Heffner <m...@librato.com> >> >> Librato, Inc. >> >> >> >> -- >> >> Jens Rantil >> Backend Developer @ Tink >> >> Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden >> For urgent matters you can reach me at +46-708-84 18 32. >> >> >> >> >> >> -- >> >> >> Mike Heffner <m...@librato.com> >> >> Librato, Inc. >> >> >> > > > > -- > > Mike Heffner <m...@librato.com> > Librato, Inc. > > -- Mike Heffner <m...@librato.com> Librato, Inc.
Re: Ring connection timeouts with 2.2.6
One thing to add, if we do a rolling restart of the ring the timeouts disappear entirely for several hours and performance returns to normal. It's as if something is leaking over time, but we haven't seen any noticeable change in heap. On Thu, Jun 23, 2016 at 10:38 AM, Mike Heffner <m...@librato.com> wrote: > Hi, > > We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that is > sitting at <25% CPU, doing mostly writes, and not showing any particular > long GC times/pauses. By all observed metrics the ring is healthy and > performing well. > > However, we are noticing a pretty consistent number of connection timeouts > coming from the messaging service between various pairs of nodes in the > ring. The "Connection.TotalTimeouts" meter metric show 100k's of timeouts > per minute, usually between two pairs of nodes for several hours at a time. > It seems to occur for several hours at a time, then may stop or move to > other pairs of nodes in the ring. The metric > "Connection.SmallMessageDroppedTasks." will also grow for one pair of > the nodes in the TotalTimeouts metric. > > Looking at the debug log typically shows a large number of messages like > the following on one of the nodes: > > StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177 (ttl 0) > > We have cross node timeouts enabled, but ntp is running on all nodes and > no node appears to have time drift. > > The network appears to be fine between nodes, with iperf tests showing > that we have a lot of headroom. > > Any thoughts on what to look for? Can we increase thread count/pool sizes > for the messaging service? > > Thanks, > > Mike > > -- > > Mike Heffner <m...@librato.com> > Librato, Inc. > > -- Mike Heffner <m...@librato.com> Librato, Inc.
Ring connection timeouts with 2.2.6
Hi, We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that is sitting at <25% CPU, doing mostly writes, and not showing any particular long GC times/pauses. By all observed metrics the ring is healthy and performing well. However, we are noticing a pretty consistent number of connection timeouts coming from the messaging service between various pairs of nodes in the ring. The "Connection.TotalTimeouts" meter metric show 100k's of timeouts per minute, usually between two pairs of nodes for several hours at a time. It seems to occur for several hours at a time, then may stop or move to other pairs of nodes in the ring. The metric "Connection.SmallMessageDroppedTasks." will also grow for one pair of the nodes in the TotalTimeouts metric. Looking at the debug log typically shows a large number of messages like the following on one of the nodes: StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177 (ttl 0) We have cross node timeouts enabled, but ntp is running on all nodes and no node appears to have time drift. The network appears to be fine between nodes, with iperf tests showing that we have a lot of headroom. Any thoughts on what to look for? Can we increase thread count/pool sizes for the messaging service? Thanks, Mike -- Mike Heffner <m...@librato.com> Librato, Inc.
Re: Consistent read timeouts for bursts of reads
Emils, We believe we've tracked it down to the following issue: https://issues.apache.org/jira/browse/CASSANDRA-11302, introduced in 2.1.5. We are running a build of 2.2.5 with that patch and so far have not seen any more timeouts. Mike On Fri, Mar 4, 2016 at 3:14 AM, Emīls Šolmanis <emils.solma...@gmail.com> wrote: > Mike, > > Is that where you've bisected it to having been introduced? > > I'll see what I can do, but doubt it, since we've long since upgraded prod > to 2.2.4 (and stage before that) and the tests I'm running were for a new > feature. > > > On Fri, 4 Mar 2016 03:54 Mike Heffner, <m...@librato.com> wrote: > >> Emils, >> >> I realize this may be a big downgrade, but are you timeouts reproducible >> under Cassandra 2.1.4? >> >> Mike >> >> On Thu, Feb 25, 2016 at 10:34 AM, Emīls Šolmanis < >> emils.solma...@gmail.com> wrote: >> >>> Having had a read through the archives, I missed this at first, but this >>> seems to be *exactly* like what we're experiencing. >>> >>> http://www.mail-archive.com/user@cassandra.apache.org/msg46064.html >>> >>> Only difference is we're getting this for reads and using CQL, but the >>> behaviour is identical. >>> >>> On Thu, 25 Feb 2016 at 14:55 Emīls Šolmanis <emils.solma...@gmail.com> >>> wrote: >>> >>>> Hello, >>>> >>>> We're having a problem with concurrent requests. It seems that whenever >>>> we try resolving more >>>> than ~ 15 queries at the same time, one or two get a read timeout and >>>> then succeed on a retry. >>>> >>>> We're running Cassandra 2.2.4 accessed via the 2.1.9 Datastax driver on >>>> AWS. >>>> >>>> What we've found while investigating: >>>> >>>> * this is not db-wide. Trying the same pattern against another table >>>> everything works fine. >>>> * it fails 1 or 2 requests regardless of how many are executed in >>>> parallel, i.e., it's still 1 or 2 when we ramp it up to ~ 120 concurrent >>>> requests and doesn't seem to scale up. >>>> * the problem is consistently reproducible. It happens both under >>>> heavier load and when just firing off a single batch of requests for >>>> testing. >>>> * tracing the faulty requests says everything is great. An example >>>> trace: https://gist.github.com/emilssolmanis/41e1e2ecdfd9a0569b1a >>>> * the only peculiar thing in the logs is there's no acknowledgement of >>>> the request being accepted by the server, as seen in >>>> https://gist.github.com/emilssolmanis/242d9d02a6d8fb91da8a >>>> * there's nothing funny in the timed out Cassandra node's logs around >>>> that time as far as I can tell, not even in the debug logs. >>>> >>>> Any ideas about what might be causing this, pointers to server config >>>> options, or how else we might debug this would be much appreciated. >>>> >>>> Kind regards, >>>> Emils >>>> >>>> >> >> >> -- >> >> Mike Heffner <m...@librato.com> >> Librato, Inc. >> >> -- Mike Heffner <m...@librato.com> Librato, Inc.
Re: Consistent read timeouts for bursts of reads
Emils, I realize this may be a big downgrade, but are you timeouts reproducible under Cassandra 2.1.4? Mike On Thu, Feb 25, 2016 at 10:34 AM, Emīls Šolmanis <emils.solma...@gmail.com> wrote: > Having had a read through the archives, I missed this at first, but this > seems to be *exactly* like what we're experiencing. > > http://www.mail-archive.com/user@cassandra.apache.org/msg46064.html > > Only difference is we're getting this for reads and using CQL, but the > behaviour is identical. > > On Thu, 25 Feb 2016 at 14:55 Emīls Šolmanis <emils.solma...@gmail.com> > wrote: > >> Hello, >> >> We're having a problem with concurrent requests. It seems that whenever >> we try resolving more >> than ~ 15 queries at the same time, one or two get a read timeout and >> then succeed on a retry. >> >> We're running Cassandra 2.2.4 accessed via the 2.1.9 Datastax driver on >> AWS. >> >> What we've found while investigating: >> >> * this is not db-wide. Trying the same pattern against another table >> everything works fine. >> * it fails 1 or 2 requests regardless of how many are executed in >> parallel, i.e., it's still 1 or 2 when we ramp it up to ~ 120 concurrent >> requests and doesn't seem to scale up. >> * the problem is consistently reproducible. It happens both under >> heavier load and when just firing off a single batch of requests for >> testing. >> * tracing the faulty requests says everything is great. An example >> trace: https://gist.github.com/emilssolmanis/41e1e2ecdfd9a0569b1a >> * the only peculiar thing in the logs is there's no acknowledgement of >> the request being accepted by the server, as seen in >> https://gist.github.com/emilssolmanis/242d9d02a6d8fb91da8a >> * there's nothing funny in the timed out Cassandra node's logs around >> that time as far as I can tell, not even in the debug logs. >> >> Any ideas about what might be causing this, pointers to server config >> options, or how else we might debug this would be much appreciated. >> >> Kind regards, >> Emils >> >> -- Mike Heffner <m...@librato.com> Librato, Inc.
Re: Debugging write timeouts on Cassandra 2.2.5
Nate, So we have run several install tests, bisecting the 2.1.x release line, and we believe that the regression was introduced in version 2.1.5. This is the first release that clearly hits the timeout for us. It looks like quite a large release, so our next step will likely be bisecting the major commits to see if we can narrow it down: https://github.com/apache/cassandra/blob/3c0a337ebc90b0d99349d0aa152c92b5b3494d8c/CHANGES.txt. Obviously, any suggestions on potential suspects appreciated. These are the memtable settings we've configured diff from the defaults during our testing: memtable_allocation_type: offheap_objects memtable_flush_writers: 8 Cheers, Mike On Fri, Feb 19, 2016 at 1:46 PM, Nate McCall <n...@thelastpickle.com> wrote: > The biggest change which *might* explain your behavior has to do with the > changes in memtable flushing between 2.0 and 2.1: > https://issues.apache.org/jira/browse/CASSANDRA-5549 > > However, the tpstats you posted shows no dropped mutations which would > make me more certain of this as the cause. > > What values do you have right now for each of these (my recommendations > for each on a c4.2xl with stock cassandra-env.sh are in parenthesis): > > - memtable_flush_writers (2) > - memtable_heap_space_in_mb (2048) > - memtable_offheap_space_in_mb (2048) > - memtable_cleanup_threshold (0.11) > - memtable_allocation_type (offheap_objects) > > The biggest win IMO will be moving to offheap_objects. By default, > everything is on heap. Regardless, spending some time tuning these for your > workload will pay off. > > You may also want to be explicit about > > - native_transport_max_concurrent_connections > - native_transport_max_concurrent_connections_per_ip > > Depending on the driver, these may now be allowing 32k streams per > connection(!) as detailed in v3 of the native protocol: > > https://github.com/apache/cassandra/blob/cassandra-2.1/doc/native_protocol_v3.spec#L130-L152 > > > > On Fri, Feb 19, 2016 at 8:48 AM, Mike Heffner <m...@librato.com> wrote: > >> Anuj, >> >> So we originally started testing with Java8 + G1, however we were able to >> reproduce the same results with the default CMS settings that ship in the >> cassandra-env.sh from the Deb pkg. We didn't detect any large GC pauses >> during the runs. >> >> Query pattern during our testing was 100% writes, batching (via Thrift >> mostly) to 5 tables, between 6-1500 rows per batch. >> >> Mike >> >> On Thu, Feb 18, 2016 at 12:22 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> >> wrote: >> >>> Whats the GC overhead? Can you your share your GC collector and settings >>> ? >>> >>> >>> Whats your query pattern? Do you use secondary indexes, batches, in >>> clause etc? >>> >>> >>> Anuj >>> >>> >>> Sent from Yahoo Mail on Android >>> <https://overview.mail.yahoo.com/mobile/?.src=Android> >>> >>> On Thu, 18 Feb, 2016 at 8:45 pm, Mike Heffner >>> <m...@librato.com> wrote: >>> Alain, >>> >>> Thanks for the suggestions. >>> >>> Sure, tpstats are here: >>> https://gist.github.com/mheffner/a979ae1a0304480b052a. Looking at the >>> metrics across the ring, there were no blocked tasks nor dropped messages. >>> >>> Iowait metrics look fine, so it doesn't appear to be blocking on disk. >>> Similarly, there are no long GC pauses. >>> >>> We haven't noticed latency on any particular table higher than others or >>> correlated around the occurrence of a timeout. We have noticed with further >>> testing that running cassandra-stress against the ring, while our workload >>> is writing to the same ring, will incur similar 10 second timeouts. If our >>> workload is not writing to the ring, cassandra stress will run without >>> hitting timeouts. This seems to imply that our workload pattern is causing >>> something to block cluster-wide, since the stress tool writes to a >>> different keyspace then our workload. >>> >>> I mentioned in another reply that we've tracked it to something between >>> 2.0.x and 2.1.x, so we are focusing on narrowing which point release it was >>> introduced in. >>> >>> Cheers, >>> >>> Mike >>> >>> On Thu, Feb 18, 2016 at 3:33 AM, Alain RODRIGUEZ <arodr...@gmail.com> >>> wrote: >>> >>>> Hi Mike, >>>> >>>> What about the output of tpstats ? I imagine you have dropped messages >>>> there. Any blocked threads ? Could you past
Re: Debugging write timeouts on Cassandra 2.2.5
Anuj, So we originally started testing with Java8 + G1, however we were able to reproduce the same results with the default CMS settings that ship in the cassandra-env.sh from the Deb pkg. We didn't detect any large GC pauses during the runs. Query pattern during our testing was 100% writes, batching (via Thrift mostly) to 5 tables, between 6-1500 rows per batch. Mike On Thu, Feb 18, 2016 at 12:22 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote: > Whats the GC overhead? Can you your share your GC collector and settings ? > > > Whats your query pattern? Do you use secondary indexes, batches, in clause > etc? > > > Anuj > > > Sent from Yahoo Mail on Android > <https://overview.mail.yahoo.com/mobile/?.src=Android> > > On Thu, 18 Feb, 2016 at 8:45 pm, Mike Heffner > <m...@librato.com> wrote: > Alain, > > Thanks for the suggestions. > > Sure, tpstats are here: > https://gist.github.com/mheffner/a979ae1a0304480b052a. Looking at the > metrics across the ring, there were no blocked tasks nor dropped messages. > > Iowait metrics look fine, so it doesn't appear to be blocking on disk. > Similarly, there are no long GC pauses. > > We haven't noticed latency on any particular table higher than others or > correlated around the occurrence of a timeout. We have noticed with further > testing that running cassandra-stress against the ring, while our workload > is writing to the same ring, will incur similar 10 second timeouts. If our > workload is not writing to the ring, cassandra stress will run without > hitting timeouts. This seems to imply that our workload pattern is causing > something to block cluster-wide, since the stress tool writes to a > different keyspace then our workload. > > I mentioned in another reply that we've tracked it to something between > 2.0.x and 2.1.x, so we are focusing on narrowing which point release it was > introduced in. > > Cheers, > > Mike > > On Thu, Feb 18, 2016 at 3:33 AM, Alain RODRIGUEZ <arodr...@gmail.com> > wrote: > >> Hi Mike, >> >> What about the output of tpstats ? I imagine you have dropped messages >> there. Any blocked threads ? Could you paste this output here ? >> >> May this be due to some network hiccup to access the disks as they are >> EBS ? Can you think of anyway of checking this ? Do you have a lot of GC >> logs, how long are the pauses (use something like: grep -i 'GCInspector' >> /var/log/cassandra/system.log) ? >> >> Something else you could check are local_writes stats to see if only one >> table if affected or this is keyspace / cluster wide. You can use metrics >> exposed by cassandra or if you have no dashboards I believe a: 'nodetool >> cfstats | grep -e 'Table:' -e 'Local'' should give you a rough idea >> of local latencies. >> >> Those are just things I would check, I have not a clue on what is >> happening here, hope this will help. >> >> C*heers, >> - >> Alain Rodriguez >> France >> >> The Last Pickle >> http://www.thelastpickle.com >> >> 2016-02-18 5:13 GMT+01:00 Mike Heffner <m...@librato.com>: >> >>> Jaydeep, >>> >>> No, we don't use any light weight transactions. >>> >>> Mike >>> >>> On Wed, Feb 17, 2016 at 6:44 PM, Jaydeep Chovatia < >>> chovatia.jayd...@gmail.com> wrote: >>> >>>> Are you guys using light weight transactions in your write path? >>>> >>>> On Thu, Feb 11, 2016 at 12:36 AM, Fabrice Facorat < >>>> fabrice.faco...@gmail.com> wrote: >>>> >>>>> Are your commitlog and data on the same disk ? If yes, you should put >>>>> commitlogs on a separate disk which don't have a lot of IO. >>>>> >>>>> Others IO may have great impact impact on your commitlog writing and >>>>> it may even block. >>>>> >>>>> An example of impact IO may have, even for Async writes: >>>>> >>>>> https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic >>>>> >>>>> 2016-02-11 0:31 GMT+01:00 Mike Heffner <m...@librato.com>: >>>>> > Jeff, >>>>> > >>>>> > We have both commitlog and data on a 4TB EBS with 10k IOPS. >>>>> > >>>>> > Mike >>>>> > >>>>> > On Wed, Feb 10, 2016 at 5:28 PM, Jeff Jirsa < >>>>> jeff.ji...@crowdstrike.com> >>>>> > wrote: >>>>> >
Re: Debugging write timeouts on Cassandra 2.2.5
Alain, Thanks for the suggestions. Sure, tpstats are here: https://gist.github.com/mheffner/a979ae1a0304480b052a. Looking at the metrics across the ring, there were no blocked tasks nor dropped messages. Iowait metrics look fine, so it doesn't appear to be blocking on disk. Similarly, there are no long GC pauses. We haven't noticed latency on any particular table higher than others or correlated around the occurrence of a timeout. We have noticed with further testing that running cassandra-stress against the ring, while our workload is writing to the same ring, will incur similar 10 second timeouts. If our workload is not writing to the ring, cassandra stress will run without hitting timeouts. This seems to imply that our workload pattern is causing something to block cluster-wide, since the stress tool writes to a different keyspace then our workload. I mentioned in another reply that we've tracked it to something between 2.0.x and 2.1.x, so we are focusing on narrowing which point release it was introduced in. Cheers, Mike On Thu, Feb 18, 2016 at 3:33 AM, Alain RODRIGUEZ <arodr...@gmail.com> wrote: > Hi Mike, > > What about the output of tpstats ? I imagine you have dropped messages > there. Any blocked threads ? Could you paste this output here ? > > May this be due to some network hiccup to access the disks as they are EBS > ? Can you think of anyway of checking this ? Do you have a lot of GC logs, > how long are the pauses (use something like: grep -i 'GCInspector' > /var/log/cassandra/system.log) ? > > Something else you could check are local_writes stats to see if only one > table if affected or this is keyspace / cluster wide. You can use metrics > exposed by cassandra or if you have no dashboards I believe a: 'nodetool > cfstats | grep -e 'Table:' -e 'Local'' should give you a rough idea > of local latencies. > > Those are just things I would check, I have not a clue on what is > happening here, hope this will help. > > C*heers, > - > Alain Rodriguez > France > > The Last Pickle > http://www.thelastpickle.com > > 2016-02-18 5:13 GMT+01:00 Mike Heffner <m...@librato.com>: > >> Jaydeep, >> >> No, we don't use any light weight transactions. >> >> Mike >> >> On Wed, Feb 17, 2016 at 6:44 PM, Jaydeep Chovatia < >> chovatia.jayd...@gmail.com> wrote: >> >>> Are you guys using light weight transactions in your write path? >>> >>> On Thu, Feb 11, 2016 at 12:36 AM, Fabrice Facorat < >>> fabrice.faco...@gmail.com> wrote: >>> >>>> Are your commitlog and data on the same disk ? If yes, you should put >>>> commitlogs on a separate disk which don't have a lot of IO. >>>> >>>> Others IO may have great impact impact on your commitlog writing and >>>> it may even block. >>>> >>>> An example of impact IO may have, even for Async writes: >>>> >>>> https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic >>>> >>>> 2016-02-11 0:31 GMT+01:00 Mike Heffner <m...@librato.com>: >>>> > Jeff, >>>> > >>>> > We have both commitlog and data on a 4TB EBS with 10k IOPS. >>>> > >>>> > Mike >>>> > >>>> > On Wed, Feb 10, 2016 at 5:28 PM, Jeff Jirsa < >>>> jeff.ji...@crowdstrike.com> >>>> > wrote: >>>> >> >>>> >> What disk size are you using? >>>> >> >>>> >> >>>> >> >>>> >> From: Mike Heffner >>>> >> Reply-To: "user@cassandra.apache.org" >>>> >> Date: Wednesday, February 10, 2016 at 2:24 PM >>>> >> To: "user@cassandra.apache.org" >>>> >> Cc: Peter Norton >>>> >> Subject: Re: Debugging write timeouts on Cassandra 2.2.5 >>>> >> >>>> >> Paulo, >>>> >> >>>> >> Thanks for the suggestion, we ran some tests against CMS and saw the >>>> same >>>> >> timeouts. On that note though, we are going to try doubling the >>>> instance >>>> >> sizes and testing with double the heap (even though current usage is >>>> low). >>>> >> >>>> >> Mike >>>> >> >>>> >> On Wed, Feb 10, 2016 at 3:40 PM, Paulo Motta < >>>> pauloricard...@gmail.com> >>>> >> wrote: >>>> >>> >>>> >>> Are you using the same GC se
Re: Debugging write timeouts on Cassandra 2.2.5
Following up from our earlier post... We have continued to do exhaustive testing and measuring of the numerous hardware and configuration variables here. What we have uncovered is that on identical hardware (including the configuration we run in production), something between versions 2.0.17 and 2.1.13 introduced this write timeout for our workload. We still aren't any closer to identifying the what or why, but it is easily reproduced using our workload when we bump to the 2.1.x release line. At the moment we are going to focus on hardening this new hardware configuration using the 2.0.17 release and roll it out internally to some of our production rings. We also want to bisect the 2.1.x release line to find if there was a particular point release that introduced the timeout. If anyone has suggestions for particular changes to look out for we'd be happy to focus a test on that earlier. Thanks, Mike On Wed, Feb 10, 2016 at 2:51 PM, Mike Heffner <m...@librato.com> wrote: > Hi all, > > We've recently embarked on a project to update our Cassandra > infrastructure running on EC2. We are long time users of 2.0.x and are > testing out a move to version 2.2.5 running on VPC with EBS. Our test setup > is a 3 node, RF=3 cluster supporting a small write load (mirror of our > staging load). > > We are writing at QUORUM and while p95's look good compared to our staging > 2.0.x cluster, we are seeing frequent write operations that time out at the > max write_request_timeout_in_ms (10 seconds). CPU across the cluster is < > 10% and EBS write load is < 100 IOPS. Cassandra is running with the Oracle > JDK 8u60 and we're using G1GC and any GC pauses are less than 500ms. > > We run on c4.2xl instances with GP2 EBS attached storage for data and > commitlog directories. The nodes are using EC2 enhanced networking and have > the latest Intel network driver module. We are running on HVM instances > using Ubuntu 14.04.2. > > Our schema is 5 tables, all with COMPACT STORAGE. Each table is similar to > the definition here: https://gist.github.com/mheffner/4d80f6b53ccaa24cc20a > > This is our cassandra.yaml: > https://gist.github.com/mheffner/fea80e6e939dd483f94f#file-cassandra-yaml > > Like I mentioned we use 8u60 with G1GC and have used many of the GC > settings in Al Tobey's tuning guide. This is our upstart config with JVM > and other CPU settings: > https://gist.github.com/mheffner/dc44613620b25c4fa46d > > We've used several of the sysctl settings from Al's guide as well: > https://gist.github.com/mheffner/ea40d58f58a517028152 > > Our client application is able to write using either Thrift batches using > Asytanax driver or CQL async INSERT's using the Datastax Java driver. > > For testing against Thrift (our legacy infra uses this) we write batches > of anywhere from 6 to 1500 rows at a time. Our p99 for batch execution is > around 45ms but our maximum (p100) sits less than 150ms except when it > periodically spikes to the full 10seconds. > > Testing the same write path using CQL writes instead demonstrates similar > behavior. Low p99s except for periodic full timeouts. We enabled tracing > for several operations but were unable to get a trace that completed > successfully -- Cassandra started logging many messages as: > > INFO [ScheduledTasks:1] - MessagingService.java:946 - _TRACE messages > were dropped in last 5000 ms: 52499 for internal timeout and 0 for cross > node timeout > > And all the traces contained rows with a "null" source_elapsed row: > https://gist.githubusercontent.com/mheffner/1d68a70449bd6688a010/raw/0327d7d3d94c3a93af02b64212e3b7e7d8f2911b/trace.out > > > We've exhausted as many configuration option permutations that we can > think of. This cluster does not appear to be under any significant load and > latencies seem to largely fall in two bands: low normal or max timeout. > This seems to imply that something is getting stuck and timing out at the > max write timeout. > > Any suggestions on what to look for? We had debug enabled for awhile but > we didn't see any msg that pointed to something obvious. Happy to provide > any more information that may help. > > We are pretty much at the point of sprinkling debug around the code to > track down what could be blocking. > > > Thanks, > > Mike > > -- > > Mike Heffner <m...@librato.com> > Librato, Inc. > > -- Mike Heffner <m...@librato.com> Librato, Inc.
Re: Debugging write timeouts on Cassandra 2.2.5
Jaydeep, No, we don't use any light weight transactions. Mike On Wed, Feb 17, 2016 at 6:44 PM, Jaydeep Chovatia < chovatia.jayd...@gmail.com> wrote: > Are you guys using light weight transactions in your write path? > > On Thu, Feb 11, 2016 at 12:36 AM, Fabrice Facorat < > fabrice.faco...@gmail.com> wrote: > >> Are your commitlog and data on the same disk ? If yes, you should put >> commitlogs on a separate disk which don't have a lot of IO. >> >> Others IO may have great impact impact on your commitlog writing and >> it may even block. >> >> An example of impact IO may have, even for Async writes: >> >> https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic >> >> 2016-02-11 0:31 GMT+01:00 Mike Heffner <m...@librato.com>: >> > Jeff, >> > >> > We have both commitlog and data on a 4TB EBS with 10k IOPS. >> > >> > Mike >> > >> > On Wed, Feb 10, 2016 at 5:28 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com >> > >> > wrote: >> >> >> >> What disk size are you using? >> >> >> >> >> >> >> >> From: Mike Heffner >> >> Reply-To: "user@cassandra.apache.org" >> >> Date: Wednesday, February 10, 2016 at 2:24 PM >> >> To: "user@cassandra.apache.org" >> >> Cc: Peter Norton >> >> Subject: Re: Debugging write timeouts on Cassandra 2.2.5 >> >> >> >> Paulo, >> >> >> >> Thanks for the suggestion, we ran some tests against CMS and saw the >> same >> >> timeouts. On that note though, we are going to try doubling the >> instance >> >> sizes and testing with double the heap (even though current usage is >> low). >> >> >> >> Mike >> >> >> >> On Wed, Feb 10, 2016 at 3:40 PM, Paulo Motta <pauloricard...@gmail.com >> > >> >> wrote: >> >>> >> >>> Are you using the same GC settings as the staging 2.0 cluster? If not, >> >>> could you try using the default GC settings (CMS) and see if that >> changes >> >>> anything? This is just a wild guess, but there were reports before of >> >>> G1-caused instabilities with small heap sizes (< 16GB - see >> CASSANDRA-10403 >> >>> for more context). Please ignore if you already tried reverting back >> to CMS. >> >>> >> >>> 2016-02-10 16:51 GMT-03:00 Mike Heffner <m...@librato.com>: >> >>>> >> >>>> Hi all, >> >>>> >> >>>> We've recently embarked on a project to update our Cassandra >> >>>> infrastructure running on EC2. We are long time users of 2.0.x and >> are >> >>>> testing out a move to version 2.2.5 running on VPC with EBS. Our >> test setup >> >>>> is a 3 node, RF=3 cluster supporting a small write load (mirror of >> our >> >>>> staging load). >> >>>> >> >>>> We are writing at QUORUM and while p95's look good compared to our >> >>>> staging 2.0.x cluster, we are seeing frequent write operations that >> time out >> >>>> at the max write_request_timeout_in_ms (10 seconds). CPU across the >> cluster >> >>>> is < 10% and EBS write load is < 100 IOPS. Cassandra is running with >> the >> >>>> Oracle JDK 8u60 and we're using G1GC and any GC pauses are less than >> 500ms. >> >>>> >> >>>> We run on c4.2xl instances with GP2 EBS attached storage for data and >> >>>> commitlog directories. The nodes are using EC2 enhanced networking >> and have >> >>>> the latest Intel network driver module. We are running on HVM >> instances >> >>>> using Ubuntu 14.04.2. >> >>>> >> >>>> Our schema is 5 tables, all with COMPACT STORAGE. Each table is >> similar >> >>>> to the definition here: >> >>>> https://gist.github.com/mheffner/4d80f6b53ccaa24cc20a >> >>>> >> >>>> This is our cassandra.yaml: >> >>>> >> https://gist.github.com/mheffner/fea80e6e939dd483f94f#file-cassandra-yaml >> >>>> >> >>>> Like I mentioned we use 8u60 with G1GC and have used many of the GC >> >>>> settings in Al Tobey's tuning guide. This is our upstart config with
Debugging write timeouts on Cassandra 2.2.5
Hi all, We've recently embarked on a project to update our Cassandra infrastructure running on EC2. We are long time users of 2.0.x and are testing out a move to version 2.2.5 running on VPC with EBS. Our test setup is a 3 node, RF=3 cluster supporting a small write load (mirror of our staging load). We are writing at QUORUM and while p95's look good compared to our staging 2.0.x cluster, we are seeing frequent write operations that time out at the max write_request_timeout_in_ms (10 seconds). CPU across the cluster is < 10% and EBS write load is < 100 IOPS. Cassandra is running with the Oracle JDK 8u60 and we're using G1GC and any GC pauses are less than 500ms. We run on c4.2xl instances with GP2 EBS attached storage for data and commitlog directories. The nodes are using EC2 enhanced networking and have the latest Intel network driver module. We are running on HVM instances using Ubuntu 14.04.2. Our schema is 5 tables, all with COMPACT STORAGE. Each table is similar to the definition here: https://gist.github.com/mheffner/4d80f6b53ccaa24cc20a This is our cassandra.yaml: https://gist.github.com/mheffner/fea80e6e939dd483f94f#file-cassandra-yaml Like I mentioned we use 8u60 with G1GC and have used many of the GC settings in Al Tobey's tuning guide. This is our upstart config with JVM and other CPU settings: https://gist.github.com/mheffner/dc44613620b25c4fa46d We've used several of the sysctl settings from Al's guide as well: https://gist.github.com/mheffner/ea40d58f58a517028152 Our client application is able to write using either Thrift batches using Asytanax driver or CQL async INSERT's using the Datastax Java driver. For testing against Thrift (our legacy infra uses this) we write batches of anywhere from 6 to 1500 rows at a time. Our p99 for batch execution is around 45ms but our maximum (p100) sits less than 150ms except when it periodically spikes to the full 10seconds. Testing the same write path using CQL writes instead demonstrates similar behavior. Low p99s except for periodic full timeouts. We enabled tracing for several operations but were unable to get a trace that completed successfully -- Cassandra started logging many messages as: INFO [ScheduledTasks:1] - MessagingService.java:946 - _TRACE messages were dropped in last 5000 ms: 52499 for internal timeout and 0 for cross node timeout And all the traces contained rows with a "null" source_elapsed row: https://gist.githubusercontent.com/mheffner/1d68a70449bd6688a010/raw/0327d7d3d94c3a93af02b64212e3b7e7d8f2911b/trace.out We've exhausted as many configuration option permutations that we can think of. This cluster does not appear to be under any significant load and latencies seem to largely fall in two bands: low normal or max timeout. This seems to imply that something is getting stuck and timing out at the max write timeout. Any suggestions on what to look for? We had debug enabled for awhile but we didn't see any msg that pointed to something obvious. Happy to provide any more information that may help. We are pretty much at the point of sprinkling debug around the code to track down what could be blocking. Thanks, Mike -- Mike Heffner <m...@librato.com> Librato, Inc.
Re: Debugging write timeouts on Cassandra 2.2.5
Paulo, Thanks for the suggestion, we ran some tests against CMS and saw the same timeouts. On that note though, we are going to try doubling the instance sizes and testing with double the heap (even though current usage is low). Mike On Wed, Feb 10, 2016 at 3:40 PM, Paulo Motta <pauloricard...@gmail.com> wrote: > Are you using the same GC settings as the staging 2.0 cluster? If not, > could you try using the default GC settings (CMS) and see if that changes > anything? This is just a wild guess, but there were reports before of > G1-caused instabilities with small heap sizes (< 16GB - see CASSANDRA-10403 > for more context). Please ignore if you already tried reverting back to CMS. > > 2016-02-10 16:51 GMT-03:00 Mike Heffner <m...@librato.com>: > >> Hi all, >> >> We've recently embarked on a project to update our Cassandra >> infrastructure running on EC2. We are long time users of 2.0.x and are >> testing out a move to version 2.2.5 running on VPC with EBS. Our test setup >> is a 3 node, RF=3 cluster supporting a small write load (mirror of our >> staging load). >> >> We are writing at QUORUM and while p95's look good compared to our >> staging 2.0.x cluster, we are seeing frequent write operations that time >> out at the max write_request_timeout_in_ms (10 seconds). CPU across the >> cluster is < 10% and EBS write load is < 100 IOPS. Cassandra is running >> with the Oracle JDK 8u60 and we're using G1GC and any GC pauses are less >> than 500ms. >> >> We run on c4.2xl instances with GP2 EBS attached storage for data and >> commitlog directories. The nodes are using EC2 enhanced networking and have >> the latest Intel network driver module. We are running on HVM instances >> using Ubuntu 14.04.2. >> >> Our schema is 5 tables, all with COMPACT STORAGE. Each table is similar >> to the definition here: >> https://gist.github.com/mheffner/4d80f6b53ccaa24cc20a >> >> This is our cassandra.yaml: >> https://gist.github.com/mheffner/fea80e6e939dd483f94f#file-cassandra-yaml >> >> Like I mentioned we use 8u60 with G1GC and have used many of the GC >> settings in Al Tobey's tuning guide. This is our upstart config with JVM >> and other CPU settings: >> https://gist.github.com/mheffner/dc44613620b25c4fa46d >> >> We've used several of the sysctl settings from Al's guide as well: >> https://gist.github.com/mheffner/ea40d58f58a517028152 >> >> Our client application is able to write using either Thrift batches using >> Asytanax driver or CQL async INSERT's using the Datastax Java driver. >> >> For testing against Thrift (our legacy infra uses this) we write batches >> of anywhere from 6 to 1500 rows at a time. Our p99 for batch execution is >> around 45ms but our maximum (p100) sits less than 150ms except when it >> periodically spikes to the full 10seconds. >> >> Testing the same write path using CQL writes instead demonstrates similar >> behavior. Low p99s except for periodic full timeouts. We enabled tracing >> for several operations but were unable to get a trace that completed >> successfully -- Cassandra started logging many messages as: >> >> INFO [ScheduledTasks:1] - MessagingService.java:946 - _TRACE messages >> were dropped in last 5000 ms: 52499 for internal timeout and 0 for cross >> node timeout >> >> And all the traces contained rows with a "null" source_elapsed row: >> https://gist.githubusercontent.com/mheffner/1d68a70449bd6688a010/raw/0327d7d3d94c3a93af02b64212e3b7e7d8f2911b/trace.out >> >> >> We've exhausted as many configuration option permutations that we can >> think of. This cluster does not appear to be under any significant load and >> latencies seem to largely fall in two bands: low normal or max timeout. >> This seems to imply that something is getting stuck and timing out at the >> max write timeout. >> >> Any suggestions on what to look for? We had debug enabled for awhile but >> we didn't see any msg that pointed to something obvious. Happy to provide >> any more information that may help. >> >> We are pretty much at the point of sprinkling debug around the code to >> track down what could be blocking. >> >> >> Thanks, >> >> Mike >> >> -- >> >> Mike Heffner <m...@librato.com> >> Librato, Inc. >> >> > -- Mike Heffner <m...@librato.com> Librato, Inc.
Re: Debugging write timeouts on Cassandra 2.2.5
Jeff, We have both commitlog and data on a 4TB EBS with 10k IOPS. Mike On Wed, Feb 10, 2016 at 5:28 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com> wrote: > What disk size are you using? > > > > From: Mike Heffner > Reply-To: "user@cassandra.apache.org" > Date: Wednesday, February 10, 2016 at 2:24 PM > To: "user@cassandra.apache.org" > Cc: Peter Norton > Subject: Re: Debugging write timeouts on Cassandra 2.2.5 > > Paulo, > > Thanks for the suggestion, we ran some tests against CMS and saw the same > timeouts. On that note though, we are going to try doubling the instance > sizes and testing with double the heap (even though current usage is low). > > Mike > > On Wed, Feb 10, 2016 at 3:40 PM, Paulo Motta <pauloricard...@gmail.com> > wrote: > >> Are you using the same GC settings as the staging 2.0 cluster? If not, >> could you try using the default GC settings (CMS) and see if that changes >> anything? This is just a wild guess, but there were reports before of >> G1-caused instabilities with small heap sizes (< 16GB - see CASSANDRA-10403 >> for more context). Please ignore if you already tried reverting back to CMS. >> >> 2016-02-10 16:51 GMT-03:00 Mike Heffner <m...@librato.com>: >> >>> Hi all, >>> >>> We've recently embarked on a project to update our Cassandra >>> infrastructure running on EC2. We are long time users of 2.0.x and are >>> testing out a move to version 2.2.5 running on VPC with EBS. Our test setup >>> is a 3 node, RF=3 cluster supporting a small write load (mirror of our >>> staging load). >>> >>> We are writing at QUORUM and while p95's look good compared to our >>> staging 2.0.x cluster, we are seeing frequent write operations that time >>> out at the max write_request_timeout_in_ms (10 seconds). CPU across the >>> cluster is < 10% and EBS write load is < 100 IOPS. Cassandra is running >>> with the Oracle JDK 8u60 and we're using G1GC and any GC pauses are less >>> than 500ms. >>> >>> We run on c4.2xl instances with GP2 EBS attached storage for data and >>> commitlog directories. The nodes are using EC2 enhanced networking and have >>> the latest Intel network driver module. We are running on HVM instances >>> using Ubuntu 14.04.2. >>> >>> Our schema is 5 tables, all with COMPACT STORAGE. Each table is similar >>> to the definition here: >>> https://gist.github.com/mheffner/4d80f6b53ccaa24cc20a >>> >>> This is our cassandra.yaml: >>> https://gist.github.com/mheffner/fea80e6e939dd483f94f#file-cassandra-yaml >>> >>> Like I mentioned we use 8u60 with G1GC and have used many of the GC >>> settings in Al Tobey's tuning guide. This is our upstart config with JVM >>> and other CPU settings: >>> https://gist.github.com/mheffner/dc44613620b25c4fa46d >>> >>> We've used several of the sysctl settings from Al's guide as well: >>> https://gist.github.com/mheffner/ea40d58f58a517028152 >>> >>> Our client application is able to write using either Thrift batches >>> using Asytanax driver or CQL async INSERT's using the Datastax Java driver. >>> >>> For testing against Thrift (our legacy infra uses this) we write batches >>> of anywhere from 6 to 1500 rows at a time. Our p99 for batch execution is >>> around 45ms but our maximum (p100) sits less than 150ms except when it >>> periodically spikes to the full 10seconds. >>> >>> Testing the same write path using CQL writes instead demonstrates >>> similar behavior. Low p99s except for periodic full timeouts. We enabled >>> tracing for several operations but were unable to get a trace that >>> completed successfully -- Cassandra started logging many messages as: >>> >>> INFO [ScheduledTasks:1] - MessagingService.java:946 - _TRACE messages >>> were dropped in last 5000 ms: 52499 for internal timeout and 0 for cross >>> node timeout >>> >>> And all the traces contained rows with a "null" source_elapsed row: >>> https://gist.githubusercontent.com/mheffner/1d68a70449bd6688a010/raw/0327d7d3d94c3a93af02b64212e3b7e7d8f2911b/trace.out >>> >>> >>> We've exhausted as many configuration option permutations that we can >>> think of. This cluster does not appear to be under any significant load and >>> latencies seem to largely fall in two bands: low normal or max timeout. >>> This seems to imply that something is getting stuck and timing out at the >>> max write timeout. >>> >>> Any suggestions on what to look for? We had debug enabled for awhile but >>> we didn't see any msg that pointed to something obvious. Happy to provide >>> any more information that may help. >>> >>> We are pretty much at the point of sprinkling debug around the code to >>> track down what could be blocking. >>> >>> >>> Thanks, >>> >>> Mike >>> >>> -- >>> >>> Mike Heffner <m...@librato.com> >>> Librato, Inc. >>> >>> >> > > > -- > > Mike Heffner <m...@librato.com> > Librato, Inc. > > -- Mike Heffner <m...@librato.com> Librato, Inc.
Re: Significant drop in storage load after 2.1.6-2.1.8 upgrade
Nate, Thanks. I dug through the changes a bit more and I believe my original observation may have been due to: https://github.com/krummas/cassandra/commit/fbc47e3b950949a8aa191bc7e91eb6cb396fe6a8 from: https://issues.apache.org/jira/browse/CASSANDRA-9572 I had originally passed over it because we are not using DTCS, but it matches since the upgrade appeared to only drop fully expired sstables. Mike On Sat, Jul 18, 2015 at 3:40 PM, Nate McCall n...@thelastpickle.com wrote: Perhaps https://issues.apache.org/jira/browse/CASSANDRA-9592 got compactions moving forward for you? This would explain the drop. However, the discussion on https://issues.apache.org/jira/browse/CASSANDRA-9683 seems to be similar to what you saw and that is currently being investigated. On Fri, Jul 17, 2015 at 10:24 AM, Mike Heffner m...@librato.com wrote: Hi all, I've been upgrading several of our rings from 2.1.6 to 2.1.8 and I've noticed that after the upgrade our storage load drops significantly (I've seen up to an 80% drop). I believe most of the data that is dropped is tombstoned (via TTL expiration) and I haven't detected any data loss yet. However, can someone point me to what changed between 2.1.6 and 2.1.8 that would lead to such a significant drop in tombstoned data? Looking at the changelog there's nothing that jumps out at me. This is a CF definition from one of the CFs that had a significant drop: describe measures_mid_1; CREATE TABLE Metrics.measures_mid_1 ( key blob, c1 int, c2 blob, c3 blob, PRIMARY KEY (key, c1, c2) ) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (c1 ASC, c2 ASC) AND bloom_filter_fp_chance = 0.01 AND caching = '{keys:ALL, rows_per_partition:NONE}' AND comment = '' AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND dclocal_read_repair_chance = 0.1 AND default_time_to_live = 0 AND gc_grace_seconds = 0 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = '99.0PERCENTILE'; Thanks, Mike -- Mike Heffner m...@librato.com Librato, Inc. -- - Nate McCall Austin, TX @zznate Co-Founder Sr. Technical Consultant Apache Cassandra Consulting http://www.thelastpickle.com -- Mike Heffner m...@librato.com Librato, Inc.
Significant drop in storage load after 2.1.6-2.1.8 upgrade
Hi all, I've been upgrading several of our rings from 2.1.6 to 2.1.8 and I've noticed that after the upgrade our storage load drops significantly (I've seen up to an 80% drop). I believe most of the data that is dropped is tombstoned (via TTL expiration) and I haven't detected any data loss yet. However, can someone point me to what changed between 2.1.6 and 2.1.8 that would lead to such a significant drop in tombstoned data? Looking at the changelog there's nothing that jumps out at me. This is a CF definition from one of the CFs that had a significant drop: describe measures_mid_1; CREATE TABLE Metrics.measures_mid_1 ( key blob, c1 int, c2 blob, c3 blob, PRIMARY KEY (key, c1, c2) ) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (c1 ASC, c2 ASC) AND bloom_filter_fp_chance = 0.01 AND caching = '{keys:ALL, rows_per_partition:NONE}' AND comment = '' AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND dclocal_read_repair_chance = 0.1 AND default_time_to_live = 0 AND gc_grace_seconds = 0 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = '99.0PERCENTILE'; Thanks, Mike -- Mike Heffner m...@librato.com Librato, Inc.
Re: How to column slice with CQL + 1.2
Tyler, Cool, yes I was actually trying to solve that exact problem of paginating with LIMIT when it ends up slicing in the middle of a set of composite columns. (though sounds like automatic ResultSet paging in 2.0.x alleviates that need). So to do composite column slicing in 1.2.x the answer is to stick with Thrift? Mike On Thu, Jul 17, 2014 at 8:27 PM, Tyler Hobbs ty...@datastax.com wrote: For this type of query, you really want the tuple notation introduced in 2.0.6 (https://issues.apache.org/jira/browse/CASSANDRA-4851): SELECT * FROM CF WHERE key='X' AND (column1, column2, column3) (1, 3, 4) AND (column1) (2) On Thu, Jul 17, 2014 at 6:01 PM, Mike Heffner m...@librato.com wrote: Michael, So if I switch to: SELECT * FROM CF WHERE key='X' AND column1=1 AND column2=3 AND column34 That doesn't include rows where column1=2, which breaks the original slice query. Maybe a better way to put it, I would like: SELECT * FROM CF WHERE key='X' AND column1=1 AND column2=3 AND column34 AND column1=2; but that is rejected with: Bad Request: PRIMARY KEY part column2 cannot be restricted (preceding part column1 is either not restricted or by a non-EQ relation) Mike On Thu, Jul 17, 2014 at 6:37 PM, Michael Dykman mdyk...@gmail.com wrote: The last term in this query is redundant. Any time column1 = 1, we may reasonably expect that it is also = 2 as that's where 1 is found. If you remove the last term, you elimiate the error and non of the selection logic. SELECT * FROM CF WHERE key='X' AND column1=1 AND column2=3 AND column34 AND column1=2; On Thu, Jul 17, 2014 at 6:23 PM, Mike Heffner m...@librato.com wrote: What is the proper way to perform a column slice using CQL with 1.2? I have a CF with a primary key X and 3 composite columns (A, B, C). I'd like to find records at: key=X columns (A=1, B=3, C=4) AND columns = (A=2) The Query: SELECT * FROM CF WHERE key='X' AND column1=1 AND column2=3 AND column34 AND column1=2; fails with: DoGetMeasures: column1 cannot be restricted by both an equal and an inequal relation This is against Cassandra 1.2.16. What is the proper way to perform this query? Cheers, Mike -- Mike Heffner m...@librato.com Librato, Inc. -- - michael dykman - mdyk...@gmail.com May the Source be with you. -- Mike Heffner m...@librato.com Librato, Inc. -- Tyler Hobbs DataStax http://datastax.com/ -- Mike Heffner m...@librato.com Librato, Inc.
Re: How to column slice with CQL + 1.2
Michael, So if I switch to: SELECT * FROM CF WHERE key='X' AND column1=1 AND column2=3 AND column34 That doesn't include rows where column1=2, which breaks the original slice query. Maybe a better way to put it, I would like: SELECT * FROM CF WHERE key='X' AND column1=1 AND column2=3 AND column34 AND column1=2; but that is rejected with: Bad Request: PRIMARY KEY part column2 cannot be restricted (preceding part column1 is either not restricted or by a non-EQ relation) Mike On Thu, Jul 17, 2014 at 6:37 PM, Michael Dykman mdyk...@gmail.com wrote: The last term in this query is redundant. Any time column1 = 1, we may reasonably expect that it is also = 2 as that's where 1 is found. If you remove the last term, you elimiate the error and non of the selection logic. SELECT * FROM CF WHERE key='X' AND column1=1 AND column2=3 AND column34 AND column1=2; On Thu, Jul 17, 2014 at 6:23 PM, Mike Heffner m...@librato.com wrote: What is the proper way to perform a column slice using CQL with 1.2? I have a CF with a primary key X and 3 composite columns (A, B, C). I'd like to find records at: key=X columns (A=1, B=3, C=4) AND columns = (A=2) The Query: SELECT * FROM CF WHERE key='X' AND column1=1 AND column2=3 AND column34 AND column1=2; fails with: DoGetMeasures: column1 cannot be restricted by both an equal and an inequal relation This is against Cassandra 1.2.16. What is the proper way to perform this query? Cheers, Mike -- Mike Heffner m...@librato.com Librato, Inc. -- - michael dykman - mdyk...@gmail.com May the Source be with you. -- Mike Heffner m...@librato.com Librato, Inc.
How to restart bootstrap after a failed streaming due to Broken Pipe (1.2.16)
Hi, During an attempt to bootstrap a new node into a 1.2.16 ring the new node saw one of the streaming nodes periodically disappear: INFO [GossipTasks:1] 2014-06-10 00:28:52,572 Gossiper.java (line 823) InetAddress /10.156.1.2 is now DOWN ERROR [GossipTasks:1] 2014-06-10 00:28:52,574 AbstractStreamSession.java (line 108) Stream failed because /10.156.1.2 died or was restarted/removed (streams may still be active in background, but further streams won't be started) WARN [GossipTasks:1] 2014-06-10 00:28:52,574 RangeStreamer.java (line 246) Streaming from /10.156.1.2 failed INFO [HANDSHAKE-/10.156.1.2] 2014-06-10 00:28:57,922 OutboundTcpConnection.java (line 418) Handshaking version with /10.156.1.2 INFO [GossipStage:1] 2014-06-10 00:28:57,943 Gossiper.java (line 809) InetAddress /10.156.1.2 is now UP This brief interruption was enough to kill the streaming from node 10.156.1.2. Node 10.156.1.2 saw a similar broken pipe exception from the bootstrapping node: ERROR [Streaming to /10.156.193.1.3] 2014-06-10 01:22:02,345 CassandraDaemon.java (line 191) Exception in thread Thread[Streaming to / 10.156.1.3:1,5,main] java.lang.RuntimeException: java.io.IOException: Broken pipe at com.google.common.base.Throwables.propagate(Throwables.java:160) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: java.io.IOException: Broken pipe at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:420) at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:552) at org.apache.cassandra.streaming.compress.CompressedFileStreamTask.stream(CompressedFileStreamTask.java:93) at org.apache.cassandra.streaming.FileStreamTask.runMayThrow(FileStreamTask.java:91) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) During bootstrapping we notice a significant spike in CPU and latency across the board on the ring (CPU 50-85% and write latencies 60ms - 250ms). It seems likely that this persistent high load led to the hiccup that caused the gossiper to see the streaming node as briefly down. What is the proper way to recover from this? The original estimate was almost 24 hours to stream all the data required to bootstrap this single node (streaming set to unlimited) and this occurred 6 hours into the bootstrap. With such high load from streaming it seems that simply restarting will inevitably hit this problem again. Cheers, Mike -- Mike Heffner m...@librato.com Librato, Inc.
Re: Failed decommission
Janne, We ran into this too. Appears it's a bug in 1.2.8 that is fixed in the upcoming 1.2.9. I added the steps I took to finally remove the node here: https://issues.apache.org/jira/browse/CASSANDRA-5857?focusedCommentId=13748998page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13748998 Cheers, Mike On Sun, Aug 25, 2013 at 4:06 AM, Janne Jalkanen janne.jalka...@ecyrd.comwrote: This on cass 1.2.8 Ring state before decommission -- Address Load Owns Host ID TokenRack UN 10.0.0.1 38.82 GB 33.3% 21a98502-dc74-4ad0-9689-0880aa110409 1 1a UN 10.0.0.2 33.5 GB33.3% cba6b27a-4982-4f04-854d-cc73155d5f69 56713727820156407428984779325531226110 1b UN 10.0.0.3 37.41 GB 0.0% 6ba2c7d4-713e-4c14-8df8-f861fb211b0d 56713727820156407428984779325531226111 1b UN 10.0.0.4 35.7 GB33.3% bf3d4792-f3e0-4062-afe3-be292bc85ed7 11342745564031281485796955865106245 1c Trying to decommission the node ubuntu@10.0.0.3:~$ nodetool decommission Exception in thread main java.lang.NumberFormatException: For input string: 56713727820156407428984779325531226111 at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:444) at java.lang.Long.parseLong(Long.java:483) at org.apache.cassandra.service.StorageService.extractExpireTime(StorageService.java:1660) at org.apache.cassandra.service.StorageService.handleStateLeft(StorageService.java:1515) at org.apache.cassandra.service.StorageService.onChange(StorageService.java:1234) at org.apache.cassandra.gms.Gossiper.doNotifications(Gossiper.java:949) at org.apache.cassandra.gms.Gossiper.addLocalApplicationState(Gossiper.java:1116) at org.apache.cassandra.service.StorageService.leaveRing(StorageService.java:2817) at org.apache.cassandra.service.StorageService.unbootstrap(StorageService.java:2861) at org.apache.cassandra.service.StorageService.decommission(StorageService.java:2808) Now I'm in a state where the machine is still up but leaving but I can't seem to get it out of the ring. For example: % nodetool removenode 6ba2c7d4-713e-4c14-8df8-f861fb211b0d Exception in thread main java.lang.UnsupportedOperationException: Node / 10.0.0.3 is alive and owns this ID. Use decommission command to remove it from the ring Any ideas? /Janne -- Mike Heffner m...@librato.com Librato, Inc.
Re: Decommission faster than bootstrap
We've also noticed fairly poor streaming performance during a bootstrap operation, albeit with 1.2.x. Streaming takes much longer than the physical hardware capacity, even with the limits set high or off: https://issues.apache.org/jira/browse/CASSANDRA-5726 On Sun, Aug 18, 2013 at 6:19 PM, Rodrigo Felix rodrigofelixdealme...@gmail.com wrote: Hi, I've noticed that, at least in my enviroment (Cassandra 1.1.12 running on Amazon EC2), decommission operations take about 3-4 minutes while bootstrap can take more than 20 minutes. What is the reason to have this time difference? For both operations, what it is time-consuming the data streaming from (or to) other node, right? Thanks in advance. Att. *Rodrigo Felix de Almeida* LSBD - Universidade Federal do Ceará Project Manager MBA, CSM, CSPO, SCJP -- Mike Heffner m...@librato.com Librato, Inc.
Re: High performance hardware with lot of data per node - Global learning about configuration
seems coherent ? Right now, performance are correct, latency 5ms almost all the time. What can I do to handle more data per node and keep these performances or get even better once ? I know this is a long message but if you have any comment or insight even on part of it, don't hesitate to share it. I guess this kind of comment on configuration is usable by the entire community. Alain -- Mike Heffner m...@librato.com Librato, Inc.
Re: High performance hardware with lot of data per node - Global learning about configuration
Aiman, I believe that is one of the cases we added a check for: https://github.com/librato/tablesnap/blob/master/tablesnap#L203-L207 Mike On Thu, Jul 11, 2013 at 1:54 PM, Aiman Parvaiz ai...@grapheffect.comwrote: Thanks for the info Mike, we ran in to a race condition which was killing table snap, I want to share the problem and the solution/ work around and may be someone can throw some light on the effects of the solution. tablesnap was getting killed with this error message: Failed uploading %s. Aborting.\n%s Looking at the code it took me to the following: def worker(self): bucket = self.get_bucket() while True: f = self.fileq.get() keyname = self.build_keyname(f) try: self.upload_sstable(bucket, keyname, f) except: self.log.critical(Failed uploading %s. Aborting.\n%s % (f, format_exc())) # Brute force kill self os.kill(os.getpid(), signal.SIGKILL) self.fileq.task_done() It builds the filename and then before it could upload it, the file disappears (which is possible), I simply commented out the line which kills tablesnap if the file is not found, it fixes the issue we were having but I would appreciate if some one has any insights on any ill effects this might have on backup or restoration process. Thanks On Jul 11, 2013, at 7:03 AM, Mike Heffner m...@librato.com wrote: We've also noticed very good read and write latencies with the hi1.4xls compared to our previous instance classes. We actually ran a mixed cluster of hi1.4xls and m2.4xls to watch side-by-side comparison. Despite the significant improvement in underlying hardware, we've noticed that streaming performance with 1.2.6+vnodes is a lot slower than we would expect. Bootstrapping a node into a ring with large storage loads can take 6+ hours. We have a JIRA open that describes our current config: https://issues.apache.org/jira/browse/CASSANDRA-5726 Aiman: We also use tablesnap for our backups. We're using a slightly modified version [1]. We currently backup every sst as soon as they hit disk (tablesnap's inotify), but we're considering moving to a periodic snapshot approach as the sst churn after going from 24 nodes - 6 nodes is quite high. Mike [1]: https://github.com/librato/tablesnap On Thu, Jul 11, 2013 at 7:33 AM, Aiman Parvaiz ai...@grapheffect.comwrote: Hi, We also recently migrated to 3 hi.4xlarge boxes(Raid0 SSD) and the disk IO performance is definitely better than the earlier non SSD servers, we are serving up to 14k reads/s with a latency of 3-3.5 ms/op. I wanted to share our config options and ask about the data back up strategy for Raid0. We are using C* 1.2.6 with key_chache and row_cache of 300MB I have not changed/ modified any other parameter except for going with multithreaded GC. I will be playing around with other factors and update everyone if I find something interesting. Also, just wanted to share backup strategy and see if I can get something useful from how others are taking backup of their raid0. I am using tablesnap to upload SSTables to s3 and I have attached a separate EBS volume to every box and have set up rsync to mirror Cassandra data from Raid0 to EBS. I would really appreciate if you guys can share how you taking backups. Thanks On Jul 9, 2013, at 7:11 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi, Using C*1.2.2. We recently dropped our 18 m1.xLarge (4CPU, 15GB RAM, 4 Raid-0 Disks) servers to get 3 hi1.4xLarge (16CPU, 60GB RAM, 2 Raid-0 SSD) servers instead, for about the same price. We tried it after reading some benchmark published by Netflix. It is awesome and I recommend it to anyone who is using more than 18 xLarge server or can afford these high cost / high performance EC2 instances. SSD gives a very good throughput with an awesome latency. Yet, we had about 200 GB data per server and now about 1 TB. To alleviate memory pressure inside the heap I had to reduce the index sampling. I changed the index_interval value from 128 to 512, with no visible impact on latency, but a great improvement inside the heap which doesn't complain about any pressure anymore. Is there some more tuning I could use, more tricks that could be useful while using big servers, with a lot of data per node and relatively high throughput ? SSD are at 20-40 % of their throughput capacity (according to OpsCenter), CPU almost never reach a bigger load than 5 or 6 (with 16 CPU), 15 GB RAM used out of 60GB. At this point I have kept my previous configuration, which is almost the default one from the Datastax community AMI. There is a part of it, you can consider that any property that is not in here is configured as default : cassandra.yaml key_cache_size_in_mb: (empty) - so default - 100MB (hit rate between 88
Re: High performance hardware with lot of data per node - Global learning about configuration
I'm curious because we are experimenting with a very similar configuration, what basis did you use for expanding the index_interval to that value? Do you have before and after numbers or was it simply reduction of the heap pressure warnings that you looked for? thanks, Mike On Tue, Jul 9, 2013 at 10:11 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi, Using C*1.2.2. We recently dropped our 18 m1.xLarge (4CPU, 15GB RAM, 4 Raid-0 Disks) servers to get 3 hi1.4xLarge (16CPU, 60GB RAM, 2 Raid-0 SSD) servers instead, for about the same price. We tried it after reading some benchmark published by Netflix. It is awesome and I recommend it to anyone who is using more than 18 xLarge server or can afford these high cost / high performance EC2 instances. SSD gives a very good throughput with an awesome latency. Yet, we had about 200 GB data per server and now about 1 TB. To alleviate memory pressure inside the heap I had to reduce the index sampling. I changed the index_interval value from 128 to 512, with no visible impact on latency, but a great improvement inside the heap which doesn't complain about any pressure anymore. Is there some more tuning I could use, more tricks that could be useful while using big servers, with a lot of data per node and relatively high throughput ? SSD are at 20-40 % of their throughput capacity (according to OpsCenter), CPU almost never reach a bigger load than 5 or 6 (with 16 CPU), 15 GB RAM used out of 60GB. At this point I have kept my previous configuration, which is almost the default one from the Datastax community AMI. There is a part of it, you can consider that any property that is not in here is configured as default : cassandra.yaml key_cache_size_in_mb: (empty) - so default - 100MB (hit rate between 88 % and 92 %, good enough ?) row_cache_size_in_mb: 0 (not usable in our use case, a lot of different and random reads) flush_largest_memtables_at: 0.80 reduce_cache_sizes_at: 0.90 concurrent_reads: 32 (I am thinking to increase this to 64 or more since I have just a few servers to handle more concurrence) concurrent_writes: 32 (I am thinking to increase this to 64 or more too) memtable_total_space_in_mb: 1024 (to avoid having a full heap, shoul I use bigger value, why for ?) rpc_server_type: sync (I tried hsha and had the ERROR 12:02:18,971 Read an invalid frame size of 0. Are you using TFramedTransport on the client side? error). No idea how to fix this, and I use 5 different clients for different purpose (Hector, Cassie, phpCassa, Astyanax, Helenus)... multithreaded_compaction: false (Should I try enabling this since I now use SSD ?) compaction_throughput_mb_per_sec: 16 (I will definitely up this to 32 or even more) cross_node_timeout: true endpoint_snitch: Ec2MultiRegionSnitch index_interval: 512 cassandra-env.sh I am not sure about how to tune the heap, so I mainly use defaults MAX_HEAP_SIZE=8G HEAP_NEWSIZE=400M (I tried with higher values, and it produced bigger GC times (1600 ms instead of 200 ms now with 400M) -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly Does this configuration seems coherent ? Right now, performance are correct, latency 5ms almost all the time. What can I do to handle more data per node and keep these performances or get even better once ? I know this is a long message but if you have any comment or insight even on part of it, don't hesitate to share it. I guess this kind of comment on configuration is usable by the entire community. Alain -- Mike Heffner m...@librato.com Librato, Inc.
Re: Streaming performance with 1.2.6
On Mon, Jul 1, 2013 at 10:06 PM, Mike Heffner m...@librato.com wrote: The only changes we've made to the config (aside from dirs/hosts) are: Forgot to include we've changed this as well: -partitioner: org.apache.cassandra.dht.Murmur3Partitioner +partitioner: org.apache.cassandra.dht.RandomPartitioner Cheers, Mike -- Mike Heffner m...@librato.com Librato, Inc.
Re: Streaming performance with 1.2.6
Sankalp, Parallel sstableloader streaming would definitely be valuable. However, this ring is currently using vnodes and I was surprised to see that a bootstrapping node only streamed from one node in the ring. My understanding was that a bootstrapping node would stream from multiple nodes in the ring. We started with a 3 node/3 AZ, RF=3 ring. We then increased that to 6 nodes, adding one per AZ. The 4th, 5th and 6th nodes only streamed from the node in their own AZ/rack which led to the serial sstable streaming. Is this the correct behavior for the snitch? Is there an option to stream from multiple replicas across the az/rack configuration? Mike On Tue, Jul 2, 2013 at 1:53 PM, sankalp kohli kohlisank...@gmail.comwrote: This was a problem pre vnodes. I had several JIRA for that but some of them were voted down saying the performance will improve with vnodes. The main problem is that it streams one sstable at a time and not in parallel. Jira 4784 can speed up the bootstrap performance. You can also do a zero copy and not touch the caches of the nodes which are contributing in the build. https://issues.apache.org/jira/browse/CASSANDRA-4663 https://issues.apache.org/jira/browse/CASSANDRA-4784 On Tue, Jul 2, 2013 at 7:35 AM, Mike Heffner m...@librato.com wrote: On Mon, Jul 1, 2013 at 10:06 PM, Mike Heffner m...@librato.com wrote: The only changes we've made to the config (aside from dirs/hosts) are: Forgot to include we've changed this as well: -partitioner: org.apache.cassandra.dht.Murmur3Partitioner +partitioner: org.apache.cassandra.dht.RandomPartitioner Cheers, Mike -- Mike Heffner m...@librato.com Librato, Inc. -- Mike Heffner m...@librato.com Librato, Inc.
Re: Streaming performance with 1.2.6
As a test, adding a 7th node in the first AZ will stream from both the two existing nodes in the same AZ. Aggregate streaming bandwidth at the 7th node is approximately 12 MB/sec when all limits are set at 800 MB/sec, or about double what I saw streaming from a single node. This would seem to indicate that the sending node is limiting our streaming rate. Mike On Tue, Jul 2, 2013 at 3:00 PM, Mike Heffner m...@librato.com wrote: Sankalp, Parallel sstableloader streaming would definitely be valuable. However, this ring is currently using vnodes and I was surprised to see that a bootstrapping node only streamed from one node in the ring. My understanding was that a bootstrapping node would stream from multiple nodes in the ring. We started with a 3 node/3 AZ, RF=3 ring. We then increased that to 6 nodes, adding one per AZ. The 4th, 5th and 6th nodes only streamed from the node in their own AZ/rack which led to the serial sstable streaming. Is this the correct behavior for the snitch? Is there an option to stream from multiple replicas across the az/rack configuration? Mike On Tue, Jul 2, 2013 at 1:53 PM, sankalp kohli kohlisank...@gmail.comwrote: This was a problem pre vnodes. I had several JIRA for that but some of them were voted down saying the performance will improve with vnodes. The main problem is that it streams one sstable at a time and not in parallel. Jira 4784 can speed up the bootstrap performance. You can also do a zero copy and not touch the caches of the nodes which are contributing in the build. https://issues.apache.org/jira/browse/CASSANDRA-4663 https://issues.apache.org/jira/browse/CASSANDRA-4784 On Tue, Jul 2, 2013 at 7:35 AM, Mike Heffner m...@librato.com wrote: On Mon, Jul 1, 2013 at 10:06 PM, Mike Heffner m...@librato.com wrote: The only changes we've made to the config (aside from dirs/hosts) are: Forgot to include we've changed this as well: -partitioner: org.apache.cassandra.dht.Murmur3Partitioner +partitioner: org.apache.cassandra.dht.RandomPartitioner Cheers, Mike -- Mike Heffner m...@librato.com Librato, Inc. -- Mike Heffner m...@librato.com Librato, Inc. -- Mike Heffner m...@librato.com Librato, Inc.
Streaming performance with 1.2.6
Hi, We've recently been testing some of the higher performance instance classes on EC2, specifically the hi1.4xlarge, with Cassandra. For those that are not familiar with them, they have two SSD disks and 10 gige. While we have observed much improved raw performance over our current instances, we are seeing a fairly large gap between Cassandra and raw performance. We have particularly noticed a gap in the streaming performance when bootstrapping a new node. I wanted to ensure that we have configured these instances correctly to get the best performance out of Cassandra. When bootstrapping a new node into a small ring with a 35GB streaming payload, we see a 5-8 MB/sec max streaming rate joining the new node to the ring. We are using 1.2.6 with 256 token vnode support. In our tests the ring is small enough so all streaming occurs from a single node. To test hardware performance for this use case, we ran an rsync of the sstables from one node to the next (to/from the same file systems) and observed a consistent rate of 115 MB/sec. The only changes we've made to the config (aside from dirs/hosts) are: -concurrent_reads: 32 -concurrent_writes: 32 +concurrent_reads: 128 # 32 +concurrent_writes: 128 # 32 -rpc_server_type: sync +rpc_server_type: hsha # sync -compaction_throughput_mb_per_sec: 16 +compaction_throughput_mb_per_sec: 256 # 16 -read_request_timeout_in_ms: 1 +read_request_timeout_in_ms: 6000 # 1 -endpoint_snitch: SimpleSnitch +endpoint_snitch: Ec2Snitch # SimpleSnitch -internode_compression: all +internode_compression: none We use a 10G heap with a 2G new size. We are using the Oracle 1.7.0_25 JVM. I've adjusted our streaming throughput limit from 200MB/sec up to 800MB/sec on both the sending and receiving streaming nodes, but that doesn't appear to make a difference. The disks are raid0 (2 * 1T SSD) with 512 read ahead, XFS. The nodes in the ring are running about 23% CPU on average, with spikes up to a maximum of 45% CPU. As I mentioned, on the same boxes with the same workloads, I've seen up to 115 MB/sec transfers with rsync. Any suggestions for what to adjust to see better streaming performance? 5% of what a single rsync can do seems somewhat limited. Thanks, Mike -- Mike Heffner m...@librato.com Librato, Inc.
Re: Upgrade 1.1.2 - 1.1.6
Alain, My understanding is that drain ensures that all memtables are flushed, so that there is no data in the commitlog that is isn't in an sstable. A marker is saved that indicates the commit logs should not be replayed. Commitlogs are only removed from disk periodically (after commitlog_total_space_in_mb is exceeded?). With 1.1.5/6, all nanotime commitlogs are replayed on startup regardless of whether they've been flushed. So in our case manually removing all the commitlogs after a drain was the only way to prevent their replay. Mike On Tue, Nov 20, 2012 at 5:19 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: @Mike I am glad to see I am not the only one with this issue (even if I am sorry it happened to you of course.). Isn't drain supposed to clear the commit logs ? Did removing them worked properly ? I his warning to C* users, Jonathan Ellis told that a drain would avoid this issue, It seems like it doesn't. @Rob You understood precisely the 2 issues I met during the upgrade. I am sad to see none of them is yet resolved and probably wont. 2012/11/20 Mike Heffner m...@librato.com Alain, We performed a 1.1.3 - 1.1.6 upgrade and found that all the logs replayed regardless of the drain. After noticing this on the first node, we did the following: * nodetool flush * nodetool drain * service cassandra stop * mv /path/to/logs/*.log /backup/ * apt-get install cassandra restarts automatically I also agree that starting C* after an upgrade/install seems quite broken if it was already stopped before the install. However annoying, I have found this to be the default for most Ubuntu daemon packages. Mike On Thu, Nov 15, 2012 at 9:21 AM, Alain RODRIGUEZ arodr...@gmail.comwrote: We had an issue with counters over-counting even using the nodetool drain command before upgrading... Here is my bash history 69 cp /etc/cassandra/cassandra.yaml /etc/cassandra/cassandra.yaml.bak 70 cp /etc/cassandra/cassandra-env.sh /etc/cassandra/cassandra-env.sh.bak 71 sudo apt-get install cassandra 72 nodetool disablethrift 73 nodetool drain 74 service cassandra stop 75 cat /etc/cassandra/cassandra-env.sh /etc/cassandra/cassandra-env.sh.bak 76 vim /etc/cassandra/cassandra-env.sh 77 cat /etc/cassandra/cassandra.yaml /etc/cassandra/cassandra.yaml.bak 78 vim /etc/cassandra/cassandra.yaml 79 service cassandra start So I think I followed these steps http://www.datastax.com/docs/1.1/install/upgrading#upgrade-steps I merged my conf files with an external tool so consider I merged my conf files on steps 76 and 78. I saw that the sudo apt-get install cassandra stop the server and restart it automatically. So it updated without draining and restart before I had the time to reconfigure the conf files. Is this normal ? Is there a way to avoid it ? So for the second node I decided to try to stop C*before the upgrade. 125 cp /etc/cassandra/cassandra.yaml /etc/cassandra/cassandra.yaml.bak 126 cp /etc/cassandra/cassandra-env.sh /etc/cassandra/cassandra-env.sh.bak 127 nodetool disablegossip 128 nodetool disablethrift 129 nodetool drain 130 service cassandra stop 131 sudo apt-get install cassandra //131 : This restarted cassandra 132 nodetool disablethrift 133 nodetool disablegossip 134 nodetool drain 135 service cassandra stop 136 cat /etc/cassandra/cassandra-env.sh /etc/cassandra/cassandra-env.sh.bak 137 cim /etc/cassandra/cassandra-env.sh 138 vim /etc/cassandra/cassandra-env.sh 139 cat /etc/cassandra/cassandra.yaml /etc/cassandra/cassandra.yaml.bak 140 vim /etc/cassandra/cassandra.yaml 141 service cassandra start After both of these updates I saw my current counters increase without any reason. Did I do anything wrong ? Alain -- Mike Heffner m...@librato.com Librato, Inc. -- Mike Heffner m...@librato.com Librato, Inc.
Re: Upgrade 1.1.2 - 1.1.6
On Tue, Nov 20, 2012 at 2:49 PM, Rob Coli rc...@palominodb.com wrote: On Mon, Nov 19, 2012 at 7:18 PM, Mike Heffner m...@librato.com wrote: We performed a 1.1.3 - 1.1.6 upgrade and found that all the logs replayed regardless of the drain. Your experience and desire for different (expected) behavior is welcomed on : https://issues.apache.org/jira/browse/CASSANDRA-4446 nodetool drain sometimes doesn't mark commitlog fully flushed If every production operator who experiences this issue shares their experience on this bug, perhaps the project will acknowledge and address it. Well in this case I think our issue was that upgrading from nanotime-epoch seconds, by definition, replays all commit logs. That's not due to any specific problem with nodetool drain not marking commitlog's flushed, but a safety to ensure data is not lost due to buggy nanotime implementations. For us, it was that the upgrade instructions pre-1.1.5-1.1.6 didn't mention that CL's should be removed if successfully drained. On the other hand, we do not use counters so replaying them was merely a much longer MTT-Return after restarting with 1.1.6. Mike -- Mike Heffner m...@librato.com Librato, Inc.
Re: Upgrade 1.1.2 - 1.1.6
Alain, We performed a 1.1.3 - 1.1.6 upgrade and found that all the logs replayed regardless of the drain. After noticing this on the first node, we did the following: * nodetool flush * nodetool drain * service cassandra stop * mv /path/to/logs/*.log /backup/ * apt-get install cassandra restarts automatically I also agree that starting C* after an upgrade/install seems quite broken if it was already stopped before the install. However annoying, I have found this to be the default for most Ubuntu daemon packages. Mike On Thu, Nov 15, 2012 at 9:21 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: We had an issue with counters over-counting even using the nodetool drain command before upgrading... Here is my bash history 69 cp /etc/cassandra/cassandra.yaml /etc/cassandra/cassandra.yaml.bak 70 cp /etc/cassandra/cassandra-env.sh /etc/cassandra/cassandra-env.sh.bak 71 sudo apt-get install cassandra 72 nodetool disablethrift 73 nodetool drain 74 service cassandra stop 75 cat /etc/cassandra/cassandra-env.sh /etc/cassandra/cassandra-env.sh.bak 76 vim /etc/cassandra/cassandra-env.sh 77 cat /etc/cassandra/cassandra.yaml /etc/cassandra/cassandra.yaml.bak 78 vim /etc/cassandra/cassandra.yaml 79 service cassandra start So I think I followed these steps http://www.datastax.com/docs/1.1/install/upgrading#upgrade-steps I merged my conf files with an external tool so consider I merged my conf files on steps 76 and 78. I saw that the sudo apt-get install cassandra stop the server and restart it automatically. So it updated without draining and restart before I had the time to reconfigure the conf files. Is this normal ? Is there a way to avoid it ? So for the second node I decided to try to stop C*before the upgrade. 125 cp /etc/cassandra/cassandra.yaml /etc/cassandra/cassandra.yaml.bak 126 cp /etc/cassandra/cassandra-env.sh /etc/cassandra/cassandra-env.sh.bak 127 nodetool disablegossip 128 nodetool disablethrift 129 nodetool drain 130 service cassandra stop 131 sudo apt-get install cassandra //131 : This restarted cassandra 132 nodetool disablethrift 133 nodetool disablegossip 134 nodetool drain 135 service cassandra stop 136 cat /etc/cassandra/cassandra-env.sh /etc/cassandra/cassandra-env.sh.bak 137 cim /etc/cassandra/cassandra-env.sh 138 vim /etc/cassandra/cassandra-env.sh 139 cat /etc/cassandra/cassandra.yaml /etc/cassandra/cassandra.yaml.bak 140 vim /etc/cassandra/cassandra.yaml 141 service cassandra start After both of these updates I saw my current counters increase without any reason. Did I do anything wrong ? Alain -- Mike Heffner m...@librato.com Librato, Inc.
Re: Hinted Handoff runs every ten minutes
Is there a ticket open for this for 1.1.6? We also noticed this after upgrading from 1.1.3 to 1.1.6. Every node runs a 0 row hinted handoff every 10 minutes. N-1 nodes hint to the same node, while that node hints to another node. On Tue, Oct 30, 2012 at 1:35 PM, Vegard Berget p...@fantasista.no wrote: Hi, I have the exact same problem with 1.1.6. HintsColumnFamily consists of one row (Rowkey 00, nothing more). The problem started after upgrading from 1.1.4 to 1.1.6. Every ten minutes HintedHandoffManager starts and finishes after sending 0 rows. .vegard, - Original Message - From: user@cassandra.apache.org To: user@cassandra.apache.org Cc: Sent: Mon, 29 Oct 2012 23:45:30 +0100 Subject: Re: Hinted Handoff runs every ten minutes Dne 29.10.2012 23:24, Stephen Pierce napsal(a): I'm running 1.1.5; the bug says it's fixed in 1.0.9/1.1.0. How can I check to see why it keeps running HintedHandoff? you have tombstone is system.HintsColumnFamily use list command in cassandra-cli to check -- Mike Heffner m...@librato.com Librato, Inc.
Re: Migrating data from a 0.8.8 - 1.1.2 ring
On Mon, Jul 23, 2012 at 1:25 PM, Mike Heffner m...@librato.com wrote: Hi, We are migrating from a 0.8.8 ring to a 1.1.2 ring and we are noticing missing data post-migration. We use pre-built/configured AMIs so our preferred route is to leave our existing production 0.8.8 untouched and bring up a parallel 1.1.2 ring and migrate data into it. Data is written to the rings via batch processes so we can easily assure that both the existing and new rings will have the same data post migration. snip The steps we are taking are: 1. Bring up a 1.1.2 ring in the same AZ/data center configuration with tokens matching the corresponding nodes in the 0.8.8 ring. 2. Create the same keyspace on 1.1.2. 3. Create each CF in the keyspace on 1.1.2. 4. Flush each node of the 0.8.8 ring. 5. Rsync each non-compacted sstable from 0.8.8 to the corresponding node in 1.1.2. 6. Move each 0.8.8 sstable into the 1.1.2 directory structure by renaming the file to the /cassandra/data/keyspace/cf/keyspace-cf... format. For example, for the keyspace Metrics and CF epochs_60 we get: cassandra/data/Metrics/epochs_60/Metrics-epochs_60-g-941-Data.db. 7. On each 1.1.2 node run `nodetool -h localhost refresh Metrics CF` for each CF in the keyspace. We notice that storage load jumps accordingly. 8. On each 1.1.2 node run `nodetool -h localhost upgradesstables`. This takes awhile but appears to correctly rewrite each sstable in the new 1.1.x format. Storage load drops as sstables are compressed. So, after some further testing we've observed that the `upgradesstables` command is removing data from the sstables, leading to our missing data. We've repeated the steps above with several variations: WORKS refresh - scrub WORKS refresh - scrub - major compaction FAILS refresh - upgradesstables FAILS refresh - scrub - upgradesstables FAILS refresh - scrub - major compaction - upgradesstables So, we are able to migrate our test CFs from a 0.8.8 ring to a 1.1.2 ring when we use scrub. However, whenever we run an upgradesstables command the sstables are shrunk significantly and our tests show missing data: INFO [CompactionExecutor:4] 2012-07-24 04:27:36,837 CompactionTask.java (line 109) Compacting [SSTableReader(path='/raid0/cassandra/data/Metrics/metrics_900/Metrics-metrics_900-hd-51-Data.db')] INFO [CompactionExecutor:4] 2012-07-24 04:27:51,090 CompactionTask.java (line 221) Compacted to [/raid0/cassandra/data/Metrics/metrics_900/Metrics-metrics_900-hd-58-Data.db,]. 60,449,155 to 2,578,102 (~4% of original) bytes for 4,002 keys at 0.172562MB/s. Time: 14,248ms. Is there a scenario where upgradesstables would remove data that a scrub command wouldn't? According the documentation, it would appear that the scrub command is actually more destructive than upgradesstables in terms of removing data. On 1.1.x, upgradesstables is the documented upgrade command over a scrub. The keyspace is defined as: Keyspace: Metrics: Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy Durable Writes: true Options: [us-east:3] And the column family above defined as: ColumnFamily: metrics_900 Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type Default column value validator: org.apache.cassandra.db.marshal.BytesType Columns sorted by: org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.LongType,org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type) GC grace seconds: 0 Compaction min/max thresholds: 4/32 Read repair chance: 0.1 DC Local Read repair chance: 0.0 Replicate on write: true Caching: KEYS_ONLY Bloom Filter FP chance: default Built indexes: [] Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy Compression Options: sstable_compression: org.apache.cassandra.io.compress.SnappyCompressor All rows have a TTL of 30 days, so it's possible that, along with the gc_grace=0, a small number would be removed during a compaction/scrub/upgradesstables step. However, the majority should still be kept as their TTL has not expired yet. We are still experimenting to see under what conditions this happens, but I thought I'd send out some more info in case there is something clearly wrong we're doing here. Thanks, Mike -- Mike Heffner m...@librato.com Librato, Inc.
Migrating data from a 0.8.8 - 1.1.2 ring
Hi, We are migrating from a 0.8.8 ring to a 1.1.2 ring and we are noticing missing data post-migration. We use pre-built/configured AMIs so our preferred route is to leave our existing production 0.8.8 untouched and bring up a parallel 1.1.2 ring and migrate data into it. Data is written to the rings via batch processes so we can easily assure that both the existing and new rings will have the same data post migration. The ring we are migrating from is: * 12 nodes * single data-center, 3 AZs * 0.8.8 The ring we are migrating to is the same except 1.1.2. The steps we are taking are: 1. Bring up a 1.1.2 ring in the same AZ/data center configuration with tokens matching the corresponding nodes in the 0.8.8 ring. 2. Create the same keyspace on 1.1.2. 3. Create each CF in the keyspace on 1.1.2. 4. Flush each node of the 0.8.8 ring. 5. Rsync each non-compacted sstable from 0.8.8 to the corresponding node in 1.1.2. 6. Move each 0.8.8 sstable into the 1.1.2 directory structure by renaming the file to the /cassandra/data/keyspace/cf/keyspace-cf... format. For example, for the keyspace Metrics and CF epochs_60 we get: cassandra/data/Metrics/epochs_60/Metrics-epochs_60-g-941-Data.db. 7. On each 1.1.2 node run `nodetool -h localhost refresh Metrics CF` for each CF in the keyspace. We notice that storage load jumps accordingly. 8. On each 1.1.2 node run `nodetool -h localhost upgradesstables`. This takes awhile but appears to correctly rewrite each sstable in the new 1.1.x format. Storage load drops as sstables are compressed. After these steps we run a script that validates data on the new ring. What we've noticed is that large portions of the data that was on the 0.8.8 is not available on the 1.1.2 ring. We've tried reading at both quorum and ONE, but the resulting data appears missing in both cases. We have fewer than 143 million row keys in the CFs we're testing and none of the *-Filter.db files are 10MB, so I don't believe this is our problem: https://issues.apache.org/jira/browse/CASSANDRA-3820 Anything else to test verify? Are the steps above correct for this type of upgrade? Is this type of upgrade/migration supported? We have also tried running a repair across the cluster after step #8. While it took a few retries due to https://issues.apache.org/jira/browse/CASSANDRA-4456, we still had missing data afterwards. Any assistance would be appreciated. Thanks! Mike -- Mike Heffner m...@librato.com Librato, Inc.
Wildcard character for CF in access.properties?
Is there a wildcard for the COLUMNFAMILY field in `access.properties`? I'd like to split read-write and read-only access between my backend and frontend users, respectively, however the full list of CFs is not known a priori. I'm using 0.7.4. Cheers, Mike -- Mike Heffner m...@librato.com Librato, Inc.