Matthew, Thanks for the help and suggestions, we really appreciate it. We’re planning on giving Riak 2.0 a shot as soon as it’s released, and are looking forward to the new features.
Best, Martin On Jan 10, 2014, at 7:51 AM, Matthew Von-Maszewski <[email protected]> wrote: > Martin, > > Assuming your business continues to grow, this problem will come back under > 1.4 … but not for a while. We can push the cache_size as far down as 8Mbytes > to make room for a little more file cache space if needed. > > The manual tunings I gave you and the subsequent block_size tuning I > mentioned are all automated in the leveldb for Riak 2.0. You should consider > that upgrade once its available (we are code complete and testing now). > > The cache sizing considerations are discussed here: > > https://github.com/basho/leveldb/wiki/mv-flexcache > > The block size considerations are discussed here: > > https://github.com/basho/leveldb/wiki/mv-dynamic-block-size > > And sooner or later you are going to be asking why deletes do not free up > space (which implies freeing up file cache space): > > https://github.com/basho/leveldb/wiki/mv-aggressive-delete > > > Let me know if you have further questions or concerns. > > Matthew > > > > > On Jan 10, 2014, at 9:41 AM, Martin May <[email protected]> wrote: > >> Hi Matthew, >> >> We applied this change to node 4, started it up, and it seems much happier >> (no crazy CPU). We’re going to keep an eye on it for a little while, and >> then apply this setting to all the other nodes as well. >> >> Is there anything we can do to prevent this scenario in the future, or >> should the settings you suggested take care of that? >> >> Thanks, >> Martin >> >> On Jan 10, 2014, at 6:42 AM, Matthew Von-Maszewski <[email protected]> >> wrote: >> >>> Sean, >>> >>> I did some math based upon the app.config and LOG files. I am guessing >>> that you are starting to thrash your file cache. >>> >>> This theory should be easy to prove / disprove. On that one node, change >>> the cache_size and max_open_files to: >>> >>> cache_size 68435456 >>> max_open_files 425 >>> >>> If I am correct, the node should come up and not cause problems. We are >>> trading block cache space for file cache space. A miss in the file cache >>> is far more costly than a miss in the block cache. >>> >>> Let me know how this works for you. It is possible that we might want to >>> talk about raising your block size slightly to reduce file cache overhead. >>> >>> Matthew >>> >>> On Jan 9, 2014, at 9:33 PM, Sean McKibben <[email protected]> wrote: >>> >>>> We have a 5 node cluster using elevelDB (1.4.2) and 2i, and this afternoon >>>> it started responding extremely slowly. CPU on member 4 was extremely high >>>> and we restarted that process, but it didn’t help. We temporarily shut >>>> down member 4 and cluster speed returned to normal, but as soon as we boot >>>> member 4 back up, the cluster performance goes to shit. >>>> >>>> We’ve run in to this before but were able to just start with a fresh set >>>> of data after wiping machines as it was before we migrated to this >>>> bare-metal cluster. Now it is causing some pretty significant issues and >>>> we’re not sure what we can do to get it back to normal, many of our queues >>>> are filling up and we’ll probably have to take node 4 off again just so we >>>> can provide a regular quality of service. >>>> >>>> We’ve turned off AAE on node 4 but it hasn’t helped. We have some >>>> transfers that need to happen but they are going very slowly. >>>> >>>> 'riak-admin top’ on node 4 reports this: >>>> Load: cpu 610 Memory: total 503852 binary >>>> 231544 >>>> procs 804 processes 179850 code >>>> 11588 >>>> runq 134 atom 533 ets >>>> 4581 >>>> >>>> Pid Name or Initial Func Time Reds >>>> Memory MsgQ Current Function >>>> ------------------------------------------------------------------------------------------------------------------------------- >>>> <6175.29048.3> proc_lib:init_p/5 '-' 462231 >>>> 51356760 0 mochijson2:json_bin_is_safe/1 >>>> <6175.12281.6> proc_lib:init_p/5 '-' 307183 >>>> 64195856 1 gen_fsm:loop/7 >>>> <6175.1581.5> proc_lib:init_p/5 '-' 286143 >>>> 41085600 0 mochijson2:json_bin_is_safe/1 >>>> <6175.6659.0> proc_lib:init_p/5 '-' 281845 >>>> 13752 0 sext:decode_binary/3 >>>> <6175.6666.0> proc_lib:init_p/5 '-' 209113 >>>> 21648 0 sext:decode_binary/3 >>>> <6175.12219.6> proc_lib:init_p/5 '-' 168832 >>>> 16829200 0 riak_client:wait_for_query_results/4 >>>> <6175.8403.0> proc_lib:init_p/5 '-' 133333 >>>> 13880 1 eleveldb:iterator_move/2 >>>> <6175.8813.0> proc_lib:init_p/5 '-' 119548 >>>> 9000 1 eleveldb:iterator/3 >>>> <6175.8411.0> proc_lib:init_p/5 '-' 115759 >>>> 34472 0 riak_kv_vnode:'-result_fun_ack/2-fun-0-' >>>> <6175.5679.0> proc_lib:init_p/5 '-' 109577 >>>> 8952 0 riak_kv_vnode:'-result_fun_ack/2-fun-0-' >>>> Output server crashed: connection_lost >>>> >>>> Based on that, is there anything anyone can think to do to try to bring >>>> performance back in to the land of usability? Does this thing appear to be >>>> something that may have been resolved in 1.4.6 or 1.4.7? >>>> >>>> Only thing we can think of at this point might be to remove or force >>>> remove the member and join in a new freshly built one, but last time we >>>> attempted that (on a different cluster) our secondary indexes got >>>> irreparably damaged and only regained consistency when we copied every >>>> individual key to (this) new cluster! Not a good experience :( but i’m >>>> hopeful that 1.4.6 may have addressed some of our issues. >>>> >>>> Any help is appreciated. >>>> >>>> Thank you, >>>> Sean McKibben >>>> >>>> >>>> _______________________________________________ >>>> riak-users mailing list >>>> [email protected] >>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>> >>> >>> _______________________________________________ >>> riak-users mailing list >>> [email protected] >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> > _______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
