As a follow-up, I'm going to write a simple patch to expose the number of flushed bytes from memtable to JMX, so that we can easily monitor it.
Here is the jira: https://issues.apache.org/jira/browse/CASSANDRA-11420 On Thu, Mar 10, 2016 at 12:55 PM, Jack Krupansky <jack.krupan...@gmail.com> wrote: > The doc does say this: > > "A log-structured engine that avoids overwrites and uses sequential IO to > update data is essential for writing to solid-state disks (SSD) and hard > disks (HDD) On HDD, writing randomly involves a higher number of seek > operations than sequential writing. The seek penalty incurred can be > substantial. Using sequential IO (thereby avoiding write amplification > <http://en.wikipedia.org/wiki/Write_amplification> and disk failure), > Cassandra accommodates inexpensive, consumer SSDs extremely well." > > I presume that write amplification argues for placing the commit log on a > separate SSD device. That should probably be mentioned. > > -- Jack Krupansky > > On Thu, Mar 10, 2016 at 12:52 PM, Matt Kennedy <matt.kenn...@datastax.com> > wrote: > >> It isn't really the data written by the host that you're concerned with, >> it's the data written by your application. I'd start by instrumenting your >> application tier to tally up the size of the values that it writes to C*. >> >> However, it may not be extremely useful to have this value. You can't do >> much with the information it provides. It is probably a better idea to >> track the bytes written to flash for each drive so that you know the >> physical endurance of that type of drive given your workload. Unfortunately >> the TBW endurance rated for the drive may not be extremely useful given the >> difference between the synthetic workload used to create those ratings and >> the workload that Cassandra is producing for your particular case. You can >> find out more about those here: >> https://www.jedec.org/standards-documents/docs/jesd219a >> >> >> Matt Kennedy >> >> Sr. Product Manager, DSE Core >> >> matt.kenn...@datastax.com | Public Calendar <http://goo.gl/4Ui04Z> >> >> *DataStax Enterprise - the database for cloud applications.* >> >> On Thu, Mar 10, 2016 at 11:44 AM, Dikang Gu <dikan...@gmail.com> wrote: >> >>> Hi Matt, >>> >>> Thanks for the detailed explanation! Yes, this is exactly what I'm >>> looking for, "write amplification = data written to flash/data written >>> by the host". >>> >>> We are heavily using the LCS in production, so I'd like to figure out >>> the amplification caused by that and see what we can do to optimize it. I >>> have the metrics of "data written to flash", and I'm wondering is there >>> an easy way to get the "data written by the host" on each C* node? >>> >>> Thanks >>> >>> On Thu, Mar 10, 2016 at 8:48 AM, Matt Kennedy <mkenn...@datastax.com> >>> wrote: >>> >>>> TL;DR - Cassandra actually causes a ton of write amplification but it >>>> doesn't freaking matter any more. Read on for details... >>>> >>>> That slide deck does have a lot of very good information on it, but >>>> unfortunately I think it has led to a fundamental misunderstanding about >>>> Cassandra and write amplification. In particular, slide 51 vastly >>>> oversimplifies the situation. >>>> >>>> The wikipedia definition of write amplification looks at this from the >>>> perspective of the SSD controller: >>>> https://en.wikipedia.org/wiki/Write_amplification#Calculating_the_value >>>> >>>> In short, write amplification = data written to flash/data written by >>>> the host >>>> >>>> So, if I write 1MB in my application, but the SSD has to write my 1MB, >>>> plus rearrange another 1MB of data in order to make room for it, then I've >>>> written a total of 2MB and my write amplification is 2x. >>>> >>>> In other words, it is measuring how much extra the SSD controller has >>>> to write in order to do its own housekeeping. >>>> >>>> However, the wikipedia definition is a bit more constrained than how >>>> the term is used in the storage industry. The whole point of looking at >>>> write amplification is to understand the impact that a particular workload >>>> is going to have on the underlying NAND by virtue of the data written. So a >>>> definition of write amplification that is a little more relevant to the >>>> context of Cassandra is to consider this: >>>> >>>> write amplification = data written to flash/data written to the database >>>> >>>> So, while the fact that we only sequentially write large immutable >>>> SSTables does in fact mean that controller-level write amplification is >>>> near zero, Compaction comes along and completely destroys that tidy little >>>> story. Think about it, every time a compaction re-writes data that has >>>> already been written, we are creating a lot of application-level write >>>> amplification. Different compaction strategies and the workload itself >>>> impact what the real application-level write amp is, but generally >>>> speaking, LCS is the worst, followed by STCS and DTCS will cause the least >>>> write-amp. To measure this, you can usually use smartctl (may be another >>>> mechanism depending on SSD manufacturer) to get the physical bytes written >>>> to your SSDs and divide that by the data that you've actually logically >>>> written to Cassandra. I've measured (more than two years ago) LCS write amp >>>> as high as 50x on some workloads, which is significantly higher than the >>>> typical controller level write amp on a b-tree style update-in-place data >>>> store. Also note that the new storage engine in general reduces a lot of >>>> inefficiency in the Cassandra storage engine therefore reducing the impact >>>> of write amp due to compactions. >>>> >>>> However, if you're a person that understands SSDs, at this point you're >>>> wondering why we aren't burning out SSDs right and left. The reality is >>>> that general SSD endurance has gotten so good, that all this write amp >>>> isn't really a problem any more. If you're curious to read more about that, >>>> I recommend you start here: >>>> >>>> >>>> http://hothardware.com/news/google-data-center-ssd-research-report-offers-surprising-results-slc-not-more-reliable-than-mlc-flash >>>> >>>> and the paper that article mentions: >>>> >>>> http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/23105-fast16-papers-schroeder.pdf >>>> >>>> >>>> Hope this helps. >>>> >>>> >>>> Matt Kennedy >>>> >>>> >>>> >>>> On Thu, Mar 10, 2016 at 7:05 AM, Paulo Motta <pauloricard...@gmail.com> >>>> wrote: >>>> >>>>> This is a good source on Cassandra + write amplification: >>>>> http://www.slideshare.net/rbranson/cassandra-and-solid-state-drives >>>>> >>>>> 2016-03-10 9:57 GMT-03:00 Benjamin Lerer <benjamin.le...@datastax.com> >>>>> : >>>>> >>>>>> Cassandra should not cause any write amplification. Write >>>>>> amplification >>>>>> appends only when you updates data on SSDs. Cassandra does not update >>>>>> any >>>>>> data in place. Data can be rewritten during compaction but it is never >>>>>> updated. >>>>>> >>>>>> Benjamin >>>>>> >>>>>> On Thu, Mar 10, 2016 at 12:42 PM, Alain RODRIGUEZ <arodr...@gmail.com >>>>>> > >>>>>> wrote: >>>>>> >>>>>> > Hi Dikang, >>>>>> > >>>>>> > I am not sure about what you call "amplification", but as sizes >>>>>> highly >>>>>> > depends on the structure I think I would probably give it a try >>>>>> using CCM ( >>>>>> > https://github.com/pcmanus/ccm) or some test cluster with >>>>>> 'production >>>>>> > like' >>>>>> > setting and schema. You can write a row, flush it and see how big >>>>>> is the >>>>>> > data cluster-wide / per node. >>>>>> > >>>>>> > Hope this will be of some help. >>>>>> > >>>>>> > C*heers, >>>>>> > ----------------------- >>>>>> > Alain Rodriguez - al...@thelastpickle.com >>>>>> > France >>>>>> > >>>>>> > The Last Pickle - Apache Cassandra Consulting >>>>>> > http://www.thelastpickle.com >>>>>> > >>>>>> > 2016-03-10 7:18 GMT+01:00 Dikang Gu <dikan...@gmail.com>: >>>>>> > >>>>>> > > Hello there, >>>>>> > > >>>>>> > > I'm wondering is there a good way to measure the write >>>>>> amplification of >>>>>> > > Cassandra? >>>>>> > > >>>>>> > > I'm thinking it could be calculated by (size of mutations written >>>>>> to the >>>>>> > > node)/(number of bytes written to the disk). >>>>>> > > >>>>>> > > Do we already have the metrics of "size of mutations written to >>>>>> the >>>>>> > node"? >>>>>> > > I did not find it in jmx metrics. >>>>>> > > >>>>>> > > Thanks >>>>>> > > >>>>>> > > -- >>>>>> > > Dikang >>>>>> > > >>>>>> > > >>>>>> > >>>>>> >>>>> >>>>> >>>> >>> >>> >>> -- >>> Dikang >>> >>> >> > -- Dikang