John and Shane, I have been looking into some memory issues lately and I would be very interested in more information about your particular problems. If either of you are able to get some output from etop using the -sort memory option when you are having elevated memory usage it would be very helpful to see. I know that sometimes you get the connection_lost message when trying to use etop, but I have found that sometimes if you keep trying it may succeed after a few attempts.
Are either of you using MapReduce? I see that John is using 2I. Shane, do you also use 2I? Finally, do you notice a lot of messages to the console or console log that have the either the phrase 'monitor large_heap' or 'monitor long_gc'? Kelly On Oct 2, 2012, at 6:11 AM, "John E. Vincent" <lusis.org+riak-us...@gmail.com> wrote: > I would highly suggest you upgrade to 1.2 when possible. We were, up > until recently, running on 1.4 and seeing the same problems you > describe. Take a look at this graph: > > http://i.imgur.com/0RtsU.png > > That's just one of our nodes but all of them exhibited the same > behavior. The falloffs are where we had to bounce riak. > > This is what one of our nodes looks like now and has looked like since > the upgrade: > > http://i.imgur.com/pm7Nk.png > > The change was SO dramatic that I seriously though /stats was broken. > I've verified outside of Riak and inside. The memory usage change was > very positive. Evidently there's even still a memory leak. > > We're heavy 2i users. No multi backend. > > On Tue, Oct 2, 2012 at 4:08 AM, Shane McEwan <sh...@mcewan.id.au> wrote: >> G'day! >> >> Just recently we've noticed memory usage in our Riak cluster constantly >> increasing. >> >> The memory usage reported by the Riak stats "memory_total" parameter has >> been less than 100MB for nearly a year but has recently increased to over >> 1GB. >> >> If we restart the cluster memory usage usually returns back to what we would >> call "normal" but after a week or so of stability the memory usage starts >> gradually growing again. Sometimes after a growth spurt over a few days the >> memory usage will plateau and be stable again for a week or two and then put >> on another growth spurt. The memory usage starts increasing at the same >> moment on all 4 nodes. >> >> This graph [http://imagebin.org/230614] shows what I mean. The green shows >> the memory usage as reported by "memory_total" (left-hand y-axis scale). The >> red line shows the memory used by Riak's beam.smp process (right-hand y-axis >> scale). >> >> Also notice that the gradient of the recent growth seems to be increasing >> compared to the memory increases we had in August. >> >> We might have just assumed that the memory usage was normal Riak behaviour. >> Perhaps we have just tipped over some sort of internal buffer or cache and >> that causes some more memory to be allocated. However, whenever we notice >> the memory usage increasing it always coincides with the "riak-admin top" >> command failing to run. >> >> We try to run "riak-admin top" to diagnose what is using the memory but it >> returns: "Output server crashed: connection_lost". If we restart the cluster >> the top command works fine (but, of course, there's nothing interesting to >> see after a restart!). >> >> So our theory at the moment is that some sort of instability or race >> condition is causing Riak to start consuming more and more memory. A side >> effect of this instability is that the internal processes needed for running >> the top command are not working correctly. The actual functionality of Riak >> doesn't seem to be affected. Our application is running fine. We see a >> slight increase in "FSM Put" times and CPU usage during the memory growth >> phases but all other parameters we're monitoring on the system seem >> unaffected. >> >> There's nothing abnormal in the logs. We get a lot of "riak_pipe_builder_sup >> {sink_died,normal}" messages but they can be ignored, apparently. The >> cluster is under constant load so we would expect to see either gradual >> memory increase or a steady state but not both. Erlang process count, open >> file handles, etc are stable. >> >> So I was wondering if anyone has seen similar behaviour before? >> Is there anything else we can do to diagnose the problem? >> I'm accessing the stats URL once per minute, could that have any side >> effects? >> We'll be upgrading to Riak 1.2 and new hardware in the next few weeks so >> should we just ignore it and hope it goes away? >> Any other ideas? >> Or is this just normal? >> >> Riak config: >> 4 VMware nodes >> ring_creation_size, 256 >> n_val, 3 >> eleveldb backend: >> max_open_files, 20 >> cache_size, 15728640 >> "riak_kv_version":"1.1.1", >> "riak_core_version":"1.1.1", >> "stdlib_version":"1.17.4", >> "kernel_version":"2.14.4" >> Erlang R14B03 (erts-5.8.4) >> >> Thanks! >> >> Shane. >> >> >> >> >> _______________________________________________ >> riak-users mailing list >> riak-users@lists.basho.com >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com