Re: How to measure the write amplification of C*?

Dikang Gu Wed, 23 Mar 2016 19:39:09 -0700

As a follow-up, I'm going to write a simple patch to expose the number of
flushed bytes from memtable to JMX, so that we can easily monitor it.


Here is the jira: https://issues.apache.org/jira/browse/CASSANDRA-11420

On Thu, Mar 10, 2016 at 12:55 PM, Jack Krupansky <jack.krupan...@gmail.com>
wrote:

> The doc does say this:
>
> "A log-structured engine that avoids overwrites and uses sequential IO to
> update data is essential for writing to solid-state disks (SSD) and hard
> disks (HDD) On HDD, writing randomly involves a higher number of seek
> operations than sequential writing. The seek penalty incurred can be
> substantial. Using sequential IO (thereby avoiding write amplification
> <http://en.wikipedia.org/wiki/Write_amplification> and disk failure),
> Cassandra accommodates inexpensive, consumer SSDs extremely well."
>
> I presume that write amplification argues for placing the commit log on a
> separate SSD device. That should probably be mentioned.
>
> -- Jack Krupansky
>
> On Thu, Mar 10, 2016 at 12:52 PM, Matt Kennedy <matt.kenn...@datastax.com>
> wrote:
>
>> It isn't really the data written by the host that you're concerned with,
>> it's the data written by your application. I'd start by instrumenting your
>> application tier to tally up the size of the values that it writes to C*.
>>
>> However, it may not be extremely useful to have this value. You can't do
>> much with the information it provides. It is probably a better idea to
>> track the bytes written to flash for each drive so that you know the
>> physical endurance of that type of drive given your workload. Unfortunately
>> the TBW endurance rated for the drive may not be extremely useful given the
>> difference between the synthetic workload used to create those ratings and
>> the workload that Cassandra is producing for your particular case. You can
>> find out more about those here:
>> https://www.jedec.org/standards-documents/docs/jesd219a
>>
>>
>> Matt Kennedy
>>
>> Sr. Product Manager, DSE Core
>>
>> matt.kenn...@datastax.com | Public Calendar <http://goo.gl/4Ui04Z>
>>
>> *DataStax Enterprise - the database for cloud applications.*
>>
>> On Thu, Mar 10, 2016 at 11:44 AM, Dikang Gu <dikan...@gmail.com> wrote:
>>
>>> Hi Matt,
>>>
>>> Thanks for the detailed explanation! Yes, this is exactly what I'm
>>> looking for, "write amplification = data written to flash/data written
>>> by the host".
>>>
>>> We are heavily using the LCS in production, so I'd like to figure out
>>> the amplification caused by that and see what we can do to optimize it. I
>>> have the metrics of "data written to flash", and I'm wondering is there
>>> an easy way to get the "data written by the host" on each C* node?
>>>
>>> Thanks
>>>
>>> On Thu, Mar 10, 2016 at 8:48 AM, Matt Kennedy <mkenn...@datastax.com>
>>> wrote:
>>>
>>>> TL;DR - Cassandra actually causes a ton of write amplification but it
>>>> doesn't freaking matter any more. Read on for details...
>>>>
>>>> That slide deck does have a lot of very good information on it, but
>>>> unfortunately I think it has led to a fundamental misunderstanding about
>>>> Cassandra and write amplification. In particular, slide 51 vastly
>>>> oversimplifies the situation.
>>>>
>>>> The wikipedia definition of write amplification looks at this from the
>>>> perspective of the SSD controller:
>>>> https://en.wikipedia.org/wiki/Write_amplification#Calculating_the_value
>>>>
>>>> In short, write amplification = data written to flash/data written by
>>>> the host
>>>>
>>>> So, if I write 1MB in my application, but the SSD has to write my 1MB,
>>>> plus rearrange another 1MB of data in order to make room for it, then I've
>>>> written a total of 2MB and my write amplification is 2x.
>>>>
>>>> In other words, it is measuring how much extra the SSD controller has
>>>> to write in order to do its own housekeeping.
>>>>
>>>> However, the wikipedia definition is a bit more constrained than how
>>>> the term is used in the storage industry. The whole point of looking at
>>>> write amplification is to understand the impact that a particular workload
>>>> is going to have on the underlying NAND by virtue of the data written. So a
>>>> definition of write amplification that is a little more relevant to the
>>>> context of Cassandra is to consider this:
>>>>
>>>> write amplification = data written to flash/data written to the database
>>>>
>>>> So, while the fact that we only sequentially write large immutable
>>>> SSTables does in fact mean that controller-level write amplification is
>>>> near zero, Compaction comes along and completely destroys that tidy little
>>>> story. Think about it, every time a compaction re-writes data that has
>>>> already been written, we are creating a lot of application-level write
>>>> amplification. Different compaction strategies and the workload itself
>>>> impact what the real application-level write amp is, but generally
>>>> speaking, LCS is the worst, followed by STCS and DTCS will cause the least
>>>> write-amp. To measure this, you can usually use smartctl (may be another
>>>> mechanism depending on SSD manufacturer) to get the physical bytes written
>>>> to your SSDs and divide that by the data that you've actually logically
>>>> written to Cassandra. I've measured (more than two years ago) LCS write amp
>>>> as high as 50x on some workloads, which is significantly higher than the
>>>> typical controller level write amp on a b-tree style update-in-place data
>>>> store. Also note that the new storage engine in general reduces a lot of
>>>> inefficiency in the Cassandra storage engine therefore reducing the impact
>>>> of write amp due to compactions.
>>>>
>>>> However, if you're a person that understands SSDs, at this point you're
>>>> wondering why we aren't burning out SSDs right and left. The reality is
>>>> that general SSD endurance has gotten so good, that all this write amp
>>>> isn't really a problem any more. If you're curious to read more about that,
>>>> I recommend you start here:
>>>>
>>>>
>>>> http://hothardware.com/news/google-data-center-ssd-research-report-offers-surprising-results-slc-not-more-reliable-than-mlc-flash
>>>>
>>>> and the paper that article mentions:
>>>>
>>>> http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/23105-fast16-papers-schroeder.pdf
>>>>
>>>>
>>>> Hope this helps.
>>>>
>>>>
>>>> Matt Kennedy
>>>>
>>>>
>>>>
>>>> On Thu, Mar 10, 2016 at 7:05 AM, Paulo Motta <pauloricard...@gmail.com>
>>>> wrote:
>>>>
>>>>> This is a good source on Cassandra + write amplification:
>>>>> http://www.slideshare.net/rbranson/cassandra-and-solid-state-drives
>>>>>
>>>>> 2016-03-10 9:57 GMT-03:00 Benjamin Lerer <benjamin.le...@datastax.com>
>>>>> :
>>>>>
>>>>>> Cassandra should not cause any write amplification. Write
>>>>>> amplification
>>>>>> appends only when you updates data on SSDs. Cassandra does not update
>>>>>> any
>>>>>> data in place. Data can be rewritten during compaction but it is never
>>>>>> updated.
>>>>>>
>>>>>> Benjamin
>>>>>>
>>>>>> On Thu, Mar 10, 2016 at 12:42 PM, Alain RODRIGUEZ <arodr...@gmail.com
>>>>>> >
>>>>>> wrote:
>>>>>>
>>>>>> > Hi Dikang,
>>>>>> >
>>>>>> > I am not sure about what you call "amplification", but as sizes
>>>>>> highly
>>>>>> > depends on the structure I think I would probably give it a try
>>>>>> using CCM (
>>>>>> > https://github.com/pcmanus/ccm) or some test cluster with
>>>>>> 'production
>>>>>> > like'
>>>>>> > setting and schema. You can write a row, flush it and see how big
>>>>>> is the
>>>>>> > data cluster-wide / per node.
>>>>>> >
>>>>>> > Hope this will be of some help.
>>>>>> >
>>>>>> > C*heers,
>>>>>> > -----------------------
>>>>>> > Alain Rodriguez - al...@thelastpickle.com
>>>>>> > France
>>>>>> >
>>>>>> > The Last Pickle - Apache Cassandra Consulting
>>>>>> > http://www.thelastpickle.com
>>>>>> >
>>>>>> > 2016-03-10 7:18 GMT+01:00 Dikang Gu <dikan...@gmail.com>:
>>>>>> >
>>>>>> > > Hello there,
>>>>>> > >
>>>>>> > > I'm wondering is there a good way to measure the write
>>>>>> amplification of
>>>>>> > > Cassandra?
>>>>>> > >
>>>>>> > > I'm thinking it could be calculated by (size of mutations written
>>>>>> to the
>>>>>> > > node)/(number of bytes written to the disk).
>>>>>> > >
>>>>>> > > Do we already have the metrics of "size of mutations written to
>>>>>> the
>>>>>> > node"?
>>>>>> > > I did not find it in jmx metrics.
>>>>>> > >
>>>>>> > > Thanks
>>>>>> > >
>>>>>> > > --
>>>>>> > > Dikang
>>>>>> > >
>>>>>> > >
>>>>>> >
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Dikang
>>>
>>>
>>
>


-- 
Dikang

Re: How to measure the write amplification of C*?

Reply via email to