Ahoj, Jan,

Our leveldb developer, Matthew, sent along a reply.  Reading it and
reading your last reply, I am at the limits of my ability to suggest
things, other than to note that if you're IO bound, running the disks
in RAID 0 rather than RAID 1 may help.

Please contact me off list if you have any issues getting those files
to Matthew because of their size.

On Wed, Oct 17, 2012 at 5:59 AM,  <[email protected]> wrote:
> Hi Evan,
>
> I corrected the setup according to your recommendations:
>
> - vm.swappiness is 0
> - fs is ext4 on software RAID1, mounted with noatime
> - disk scheduler is set to deadline (it was the default)
> - eleveldb max_open_files is set to 200, cache is set to default
>
> (BTW, why is Riak not using the new O_NOATIME open(2) flag?)
>
> I restarted the last test with 3x40G and 1x14G DB, and it was able to sustain 
> 1000 ops/sec for 5 minutes. Then node5 stalled with the call stack described 
> in the original mail, 1 of 4 cores almost 100% busy. The node did write 29 
> M/s (140 IOPs), with an occasional read (<5 IOPs), with 252 open LevelDB 
> files. The disk has 869G of free space.
>
> When I looked at the performance graphs 17 hours later, it still did write at 
> cca 29M/s (120 IOPs), with the same call stack. The Riak node was  busy even 
> after 17 hours without any application requests, and it was not even 
> connected to the rest of the Riak cluster (the node was not listed by 
> erlang:nodes() on other nodes). I would suspect a bug in LevelDB, but people 
> are using it in production, aren't they?
>
> I intend to retry the test without the software RAID. Any other hints?
>
> Best regards, Jan
>
> ---------- Původní zpráva ----------
> Od: Evan Vigil-McClanahan
> Datum: 12. 10. 2012
> Předmět: Re: Re: Riak performance problems when LevelDB database grows beyond 
> 16GB
> Hi there, Jan,
>
> The lsof issue is that max_open_files is per backend, iirc, so if
> you're maxed out you'll see vnode count * max_open_files.
>
> I think on the second try, you may have set the cache too high.   I'd
> drop it back to 8 or 16 MB, and possibly up the open files a bit more,
> but you don't seem to be running into contention at this point.
> There's a RAM cost, so maybe just leave it where it is for now, unless
> you have quite a lot of memory.
>
> Another thing to check is that vm.swappiness is set to 0 and that your
> disk scheduler is set to deadline for spinning disks and noop for
> SSDs.
>
> On Fri, Oct 12, 2012 at 5:02 AM,   wrote:
>>> Can you attach the eleveldb portion of your app.config file?
>>> Configuration problems, especially max_open_files being too low, can
>>> often cause issues like this.
>>>
>>> If it isn't sensitive, the whole app.config and vm.args files are also
>>> often helpful.
>>
>> Hello Evan,
>>
>> thanks for responding.
>>
>> I originally had default LevelDB settings. When the node stalled, I changed 
>> it
>>  to
>>
>>  {eleveldb, [
>>              {data_root, "/home/riak/leveldb"},
>>              {max_open_files, 132},
>>              {cache_size, 377487360}
>>             ]},
>>
>> on all nodes and I restarted them all. The application started to run with
>> about 1000 requests/second, after about 1 minute it dropped to <500
>> requests/second, and the node stalled again after 41 minutes. BTW according 
>> to
>>  lsof(1) it had 267 open LevelDB files which is more than the 132 files limit
>> (??).

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to