Ahoj, Jan, Our leveldb developer, Matthew, sent along a reply. Reading it and reading your last reply, I am at the limits of my ability to suggest things, other than to note that if you're IO bound, running the disks in RAID 0 rather than RAID 1 may help.
Please contact me off list if you have any issues getting those files to Matthew because of their size. On Wed, Oct 17, 2012 at 5:59 AM, <[email protected]> wrote: > Hi Evan, > > I corrected the setup according to your recommendations: > > - vm.swappiness is 0 > - fs is ext4 on software RAID1, mounted with noatime > - disk scheduler is set to deadline (it was the default) > - eleveldb max_open_files is set to 200, cache is set to default > > (BTW, why is Riak not using the new O_NOATIME open(2) flag?) > > I restarted the last test with 3x40G and 1x14G DB, and it was able to sustain > 1000 ops/sec for 5 minutes. Then node5 stalled with the call stack described > in the original mail, 1 of 4 cores almost 100% busy. The node did write 29 > M/s (140 IOPs), with an occasional read (<5 IOPs), with 252 open LevelDB > files. The disk has 869G of free space. > > When I looked at the performance graphs 17 hours later, it still did write at > cca 29M/s (120 IOPs), with the same call stack. The Riak node was busy even > after 17 hours without any application requests, and it was not even > connected to the rest of the Riak cluster (the node was not listed by > erlang:nodes() on other nodes). I would suspect a bug in LevelDB, but people > are using it in production, aren't they? > > I intend to retry the test without the software RAID. Any other hints? > > Best regards, Jan > > ---------- Původní zpráva ---------- > Od: Evan Vigil-McClanahan > Datum: 12. 10. 2012 > Předmět: Re: Re: Riak performance problems when LevelDB database grows beyond > 16GB > Hi there, Jan, > > The lsof issue is that max_open_files is per backend, iirc, so if > you're maxed out you'll see vnode count * max_open_files. > > I think on the second try, you may have set the cache too high. I'd > drop it back to 8 or 16 MB, and possibly up the open files a bit more, > but you don't seem to be running into contention at this point. > There's a RAM cost, so maybe just leave it where it is for now, unless > you have quite a lot of memory. > > Another thing to check is that vm.swappiness is set to 0 and that your > disk scheduler is set to deadline for spinning disks and noop for > SSDs. > > On Fri, Oct 12, 2012 at 5:02 AM, wrote: >>> Can you attach the eleveldb portion of your app.config file? >>> Configuration problems, especially max_open_files being too low, can >>> often cause issues like this. >>> >>> If it isn't sensitive, the whole app.config and vm.args files are also >>> often helpful. >> >> Hello Evan, >> >> thanks for responding. >> >> I originally had default LevelDB settings. When the node stalled, I changed >> it >> to >> >> {eleveldb, [ >> {data_root, "/home/riak/leveldb"}, >> {max_open_files, 132}, >> {cache_size, 377487360} >> ]}, >> >> on all nodes and I restarted them all. The application started to run with >> about 1000 requests/second, after about 1 minute it dropped to <500 >> requests/second, and the node stalled again after 41 minutes. BTW according >> to >> lsof(1) it had 267 open LevelDB files which is more than the 132 files limit >> (??). _______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
