Thanks Nate. We do not have monitoring set up yet, but I should be able to get the deployment updated with a metrics reporter. I'll update the thread with my findings.
On Tue, Sep 20, 2016 at 10:30 PM, Nate McCall <n...@thelastpickle.com> wrote: > If you can get to them in the test env. you want to look in > o.a.c.metrics.CommitLog for: > - TotalCommitlogSize: if this hovers near commitlog_size_in_mb and never > goes down, you are thrashing on segment allocation > - WaitingOnCommit: this is the time spent waiting on calls to sync and > will start to climb real fast if you cant sync within sync_interval > - WaitingOnSegmentAllocation: how long it took to allocate a new commitlog > segment, if it is all over the place it is IO bound > > Try turning all the commit log settings way down for low-IO test > infrastructure like this. Maybe total commit log size of like 32mb with 4mb > segments (or even lower depending on test data volume) so they basically > flush constantly and don't try to hold any tables open. Also lower > concurrent_writes substantially while you are at it to add some write > throttling. > > On Wed, Sep 21, 2016 at 2:14 PM, John Sanda <john.sa...@gmail.com> wrote: > >> I have seen in various threads on the list that 3.0.x is probably best >> for prod. Just wondering though if there is anything in particular in 3.7 >> to be weary of. >> >> I need to check with one of our QA engineers to get specifics on the >> storage. Here is what I do know. We have a blade center running lots of >> virtual machines for various testing. Some of those vm's are running >> Cassandra and the Java web apps I previously mentioned via docker >> containers. The storage is shared. Beyond that I don't have any more >> specific details at the moment. I can also tell you that the storage can be >> quite slow. >> >> I have come across different threads that talk to one degree or another >> about the flush queue getting full. I have been looking at the code in >> ColumnFamilyStore.java. Is perDiskFlushExecutors the thread pool I should >> be interested in? It uses an unbounded queue, so I am not really sure what >> it means for it to get full. Is there anything I can check or look for to >> see if writes are getting blocked? >> >> On Tue, Sep 20, 2016 at 8:41 PM, Jonathan Haddad <j...@jonhaddad.com> >> wrote: >> >>> If you haven't yet deployed to prod I strongly recommend *not* using >>> 3.7. >>> >>> What network storage are you using? Outside of a handful of highly >>> experienced experts using EBS in very specific ways, it usually ends in >>> failure. >>> >>> On Tue, Sep 20, 2016 at 3:30 PM John Sanda <john.sa...@gmail.com> wrote: >>> >>>> I am deploying multiple Java web apps that connect to a Cassandra 3.7 >>>> instance. Each app creates its own schema at start up. One of the schema >>>> changes involves dropping a table. I am seeing frequent client-side >>>> timeouts reported by the DataStax driver after the DROP TABLE statement is >>>> executed. I don't see this behavior in all environments. I do see it >>>> consistently in a QA environment in which Cassandra is running in docker >>>> with network storage, so writes are pretty slow from the get go. In my logs >>>> I see a lot of tables getting flushed, which I guess are all of the dirty >>>> column families in the respective commit log segment. Then I seen a whole >>>> bunch of flushes getting queued up. Can I reach a point in which too many >>>> table flushes get queued such that writes would be blocked? >>>> >>>> >>>> -- >>>> >>>> - John >>>> >>> >> >> >> -- >> >> - John >> > > > > -- > ----------------- > Nate McCall > Wellington, NZ > @zznate > > CTO > Apache Cassandra Consulting > http://www.thelastpickle.com > -- - John