Re: avg write io wait time regression in 1.2.1

Dietrich Featherston Tue, 06 Nov 2012 16:04:28 -0800

Thanks for the feedback. We haven't noticed any drop in riak
responsiveness--quite the opposite. We were just alarmed at some of the
iostat information we were seeing which may very well result from, as you
pointed out, greater concurrency in layers above the disk subsystem. It's
not of concern at the moment.


As for the cluster itself--we are running 9 physical notes with a ring size
of 64. Each with a 2.93Ghz 8-core Xeon and ~2 TB of RAID0 SSD storage
across 6 physical drives.

Thanks again for looking into this.

D


On Tue, Nov 6, 2012 at 3:19 PM, Matthew Von-Maszewski <[email protected]>wrote:

> Dietrich,
>
> I finally reviewed your LOG.all today.  The basic analysis is:
>
> - you have a really fast disk subsystem, and
> - your machine is bored.
>
> I make the first comment based upon the fact that your Level-0 file
> creations take less than 200 ms on files of 40Mbyte with 10,000 keys (or
> more).  I would like to know more specifics about your hardware and
> operating system (and disk parameters).  That information is likely worthy
> of sharing with all.
>
> I make the second comment after running a shallow analysis of all
> compaction activity per minute in LOG.all.  Your current ingest rate (rate
> of adding data to leveldb) is not very high compared to other sites I have
> seen.  So we should continue this discussion if you still believe there is
> an overall slow down.
>
> I will note that the authors of Erlang have given us some suggested
> changes in our Erlang to leveldb interface this past weekend.  That
> information could also give leveldb a throughput boost if proven valid.
>  Keep you posted.
>
> But at this time I see nothing that yells "massive slow down".  I am of
> course open to being wrong.
>
> Matthew
>
>
> On Nov 2, 2012, at 10:32 AM, Dietrich Featherston wrote:
>
> Here's the level output from one of our upgraded nodes. Included our
> app.config as well. Will continue looking for clues. I can put together
> some snapshots of our sst file sizes across the cluster if you think that
> would help as well.
>
> On Fri, Nov 2, 2012 at 6:19 AM, Matthew Von-Maszewski 
> <[email protected]>wrote:
>
>> Dietrich,
>>
>> I can make two guesses into the increased disk writes.  But I am also
>> willing to review your actual LOG files to isolate root cause.  If you
>> could run the following and post the resulting file from one server, I will
>> review it over the weekend or early next week:
>>
>> sort /var/lib/riak/leveldb/*/LOG >LOG.all
>>
>> The file will compress well.  And no need to stop the server, just gather
>> the LOG data live.
>>
>> Guess 1:  your data is in a transition phase.  1.1 used 2 Megabyte files
>> exclusively.  1.2 is resizing the files to much larger sizes during a
>> compaction.  You could be seeing a larger number of files than usual
>> participating in each compaction as the file sizes change.  While this is
>> possible, I have doubts … hence this is a guess.
>>
>> Guess 2:  I increased the various leveldb file sizes to reduce the number
>> of open and closes, both for writes and random reads.  This helped
>> latencies in both the compactions and random reads.  Any compaction in 1.2
>> is likely to reread and write larger total number of bytes.  While this is
>> possible, I again have doubts … the number of read operations should also
>> go up if this guess is correct.  Your read operations have not increased.
>>  This guess might still be valid if the read operations were satisfied by
>> the Linux memory data cache.  I do not how those would be counted or not
>> counted.
>>
>>
>> Matthew
>>
>>
>> On Nov 1, 2012, at 10:01 PM, Dietrich Featherston wrote:
>>
>> Will check on that.
>>
>> Can you think of anything that would explain the 5x increase in disk
>> writes we are seeing with the same workload?
>>
>>
>> On Thu, Nov 1, 2012 at 6:03 PM, Matthew Von-Maszewski <[email protected]
>> > wrote:
>>
>>> Look for any activity in the LOG.  Level-0 "creations" are fast and not
>>> typically relevant.  You would be most interested in LOG lines containing
>>> "Compacting" (start) and "Compacted" (end).  The time in between will
>>> throttle.  The idea is that these compaction events can pile up, one after
>>> another and multiple overlapping.  It is these heavy times where the
>>> throttle saves the user experience.
>>>
>>> Matthew
>>>
>>>
>>> On Nov 1, 2012, at 8:54 PM, Dietrich Featherston wrote:
>>>
>>> Thanks. The amortized stalls may very well describe what we are seeing.
>>> If I combine leveldb logs from all partitions on one of the upgraded nodes
>>> what should I look for in terms of compaction activity to verify this?
>>>
>>>
>>> On Thu, Nov 1, 2012 at 5:48 PM, Matthew Von-Maszewski <
>>> [email protected]> wrote:
>>>
>>>> Dietrich,
>>>>
>>>> I can see your concern with the write IOS statistic.  Let me comment on
>>>> the easy question first:  block_size.
>>>>
>>>> The block_size parameter in 1.1 was not getting passed to leveldb from
>>>> the erlang layer.  You were using a 4096 byte block parameter no matter
>>>> what you typed in the app.config.  The block_size is used by leveldb as a
>>>> threshold.  Once you accumulate enough data above that threshold, the
>>>> current block is written to disk and a new one started.  If you have 10k
>>>> data values, your get one data item per block and its size is ~10k.  If you
>>>> have 1k data values, you get about four per block and the block is about 
>>>> 4k.
>>>>
>>>> We recommend 4k blocks to help read performance.  The entire block has
>>>> to run through decompression and potentially CRC calculation when it comes
>>>> off the disk.  That CPU time really kills any disk performance gains by
>>>> having larger blocks.  Ok, that might change in 1.3 as we enable hardware
>>>> CRC … but only if you have "verify_checksums, true" in app.config.
>>>>
>>>>
>>>> Back to performance:  I have not seen the change your graph details
>>>> when testing with SAS drives under moderate load.  I am only today starting
>>>> qualification tests with SSD drives.
>>>>
>>>> But my 1.2 and 1.3 tests focus on drive / Riak saturation.  1.1 has the
>>>> nasty tendency to stall (intentionally) when we saturate the write side of
>>>> leveldb, .  The stall was measured in seconds or even minutes in 1.1.
>>>>  1.2.1 has a write throttle that forecasts leveldb's stall state and
>>>> incrementally slows the individual writes to prevent the stalls.  Maybe
>>>> that is what is being seen in the graph.  The only way to know for sure is
>>>> to get an dump of your leveldb LOG files, combined them, then compare
>>>> compaction activity to your graph.
>>>>
>>>> Write stalls are detailed here:
>>>> http://basho.com/blog/technical/2012/10/30/leveldb-in-riak-1p2/
>>>>
>>>> How can I better assist you at this point?
>>>>
>>>> Matthew
>>>>
>>>>
>>>> On Nov 1, 2012, at 8:13 PM, Dietrich Featherston wrote:
>>>>
>>>> We've just gone through the process of upgrading two riak clusters from
>>>> 1.1  to 1.2.1. Both are on the leveldb backend backed by RAID0'd SSDs. The
>>>> process has gone smoothly and we see that latencies as measured at the
>>>> gen_fsm level are largely unaffected.
>>>>
>>>> However, we are seeing some troubling disk statistics and I'm looking
>>>> for an explanation before we upgrade the remainder of our nodes. The source
>>>> of the worry seems to be a huge amplification in the number of writes
>>>> serviced by the disk which may be the cause of rising io wait times.
>>>>
>>>> My first thought was that this could be due to some leveldb tuning in
>>>> 1.2.1 which increases file sizes per the release notes (
>>>> https://github.com/basho/riak/blob/master/RELEASE-NOTES.md). But nodes
>>>> that were upgraded yesterday are still showing this symptom. I would have
>>>> expected any block re-writing to have subsided by now.
>>>>
>>>> Next hypothesis has to do with block size overriding in app.config. In
>>>> 1.1, we had specified custom block sizes of 256k. We removed this prior to
>>>> upgrading to 1.2.1 at the advice of #riak since block size configuration
>>>> was ignored prior to 1.2 ('"block_size" parameter within app.config for
>>>> leveldb was ignored.  This parameter is now properly passed to leveldb.'
>>>> -->
>>>> https://github.com/basho/riak/commit/f12596c221a9d942cc23d8e4fd83c9ca46e02105).
>>>> I'm wondering if the block size parameter really was being passed to
>>>> leveldb, and having removed it, blocks are now being rewritten to a new
>>>> size, perhaps different from what they were being written as before (
>>>> https://github.com/basho/riak_kv/commit/ad192ee775b2f5a68430d230c0999a2caabd1155
>>>> )
>>>>
>>>> Here is the output of the following script showing the increased writes
>>>> to disk (https://gist.github.com/37319a8ed2679bb8b21d)
>>>>
>>>>  --an upgraded 1.2.1 node--
>>>> read ios: 238406742
>>>> write ios: 4814320281
>>>> read/write ratio: .04952033
>>>> avg wait: .10712340
>>>> read wait: .49174364
>>>> write wait: .42695475
>>>>
>>>>
>>>> --a node still running 1.1--
>>>> read ios: 267770032
>>>> write ios: 944170656
>>>> read/write ratio: .28360342
>>>> avg wait: .34237204
>>>> read wait: .47222371
>>>> write wait: 1.83283749
>>>>
>>>> And here's what munin is showing us in terms of avg io wait times.
>>>>
>>>> <image.png>
>>>>
>>>>
>>>> Any thoughts on what might explain this?
>>>>
>>>> Thanks,
>>>> D
>>>>
>>>>
>>>> _______________________________________________
>>>> riak-users mailing list
>>>> [email protected]
>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> riak-users mailing list
>>>> [email protected]
>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>
>>>>
>>>
>>>
>>
>>
> <disk-latency-riak-1.2.1.tar.gz>
>
>
>

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: avg write io wait time regression in 1.2.1

Reply via email to