Again, all of these things are signs of large objects, so if you could track the object_size stats on the cluster, I think that we might see something. Even if you have no monitoring, a simple shell script curling /stats/ on each node once a minute should do the job for a day or two.
On Wed, Apr 3, 2013 at 9:29 AM, Ingo Rockel <[email protected]> wrote: > We just had it again (around this time of the day we have our highest user > activity). > > I will set +P to 131072 tomorrow, anything else I should check or change? > > What about this memory-high-watermark which I get sporadically? > > Ingo > > Am 03.04.2013 17:57, schrieb Evan Vigil-McClanahan: > >> As for +P it's been raised in R16 (which is on the current man page) >> on R15 it's only 32k. >> >> The behavior that you're describing does sound like a very large >> object getting put into the cluster (which may cause backups and push >> you up against the process limit, could have caused scheduler collapse >> on 1.2, etc.). >> >> On Wed, Apr 3, 2013 at 8:39 AM, Ingo Rockel >> <[email protected]> wrote: >>> >>> Evan, >>> >>> sys_process_count is somewhere between 5k and 11k on the nodes right now. >>> Concerning your suggested +P config, according to the erlang-docs, the >>> default for this param already is 262144, so setting it to 655536 would >>> in >>> fact lower it? >>> >>> We chose the ring size to be able to handle growth which was the main >>> reason >>> to switch from mysql to nosql/riak. We have 12 Nodes, so about 86 vnodes >>> per >>> node. >>> >>> No, we don't monitor object sizes, the majority of objects is very small >>> (below 200 bytes), but we have objects storing references to this small >>> objects which might grow to a few megabytes in size, most of these are >>> paged >>> though and should not exceed one megabyte. Only one type is not paged >>> (implementation reasons). >>> >>> The outgoing/incoming traffic constantly is around 100 Mbit, if the >>> peformance drops happen, we suddenly see spikes up to 1GBit. And these >>> spikes constantly happen on three nodes as long as the performance drop >>> exists. >>> >>> Ingo >>> >>> Am 03.04.2013 17:12, schrieb Evan Vigil-McClanahan: >>> >>>> Ingo, >>>> >>>> riak-admin status | grep sys_process_count >>>> >>>> will tell you how many processes are running. The default process >>>> limit on erlang is a little low, and we'd suggest raising in >>>> (especially with your extra-large ring_size). Erlang processes are >>>> cheap, so 65535 or even double that will be fine. >>>> >>>> Busy dist ports are still worrying. Are you monitoring object sizes? >>>> Are there any spikes there associated with performance drops? >>>> >>>> On Wed, Apr 3, 2013 at 8:03 AM, Ingo Rockel >>>> <[email protected]> wrote: >>>>> >>>>> >>>>> Hi Evan, >>>>> >>>>> I set swt very_low and zdbbl to 64MB, setting this params helped >>>>> reducing >>>>> the busy_dist_port and Monitor got {suppressed,... Messages a lot. But >>>>> when >>>>> the performance of the cluster suddenly drops we still see these >>>>> messages. >>>>> >>>>> The cluster was updated to 1.3 in the meantime. >>>>> >>>>> The eleveldb section: >>>>> >>>>> %% eLevelDB Config >>>>> {eleveldb, [ >>>>> {data_root, "/var/lib/riak/leveldb"}, >>>>> {cache_size, 33554432}, >>>>> {write_buffer_size_min, 67108864}, %% 64 MB in bytes >>>>> {write_buffer_size_max, 134217728}, %% 128 MB in bytes >>>>> {max_open_files, 4000} >>>>> ]}, >>>>> >>>>> the ring size is 1024 and the machines have 48GB of memory. Concerning >>>>> the >>>>> params from vm.args: >>>>> >>>>> -env ERL_MAX_PORTS 4096 >>>>> -env ERL_MAX_ETS_TABLES 8192 >>>>> >>>>> +P isn't set >>>>> >>>>> Ingo >>>>> >>>>> Am 03.04.2013 16:53, schrieb Evan Vigil-McClanahan: >>>>> >>>>>> For your prior mail, I thought that someone had answered. Our initial >>>>>> suggestion was to add +swt very_low to your vm.args, as well as >>>>>> setting the +zdbbl setting that Jon recommended in the list post you >>>>>> pointed to. If those help, moving to 1.3 should help more. >>>>>> >>>>>> Other limits in vm.args that can cause problems are +P, ERL_MAX_PORTS, >>>>>> and ERL_MAX_ETS_TABLES. Are any of these set? If so, to what? >>>>>> >>>>>> Can you also pate the eleveldb section of your app.config? >>>>>> >>>>>> On Wed, Apr 3, 2013 at 7:41 AM, Ingo Rockel >>>>>> <[email protected]> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hi Evan, >>>>>>> >>>>>>> I'm not sure, I find a lot of these: >>>>>>> >>>>>>> 2013-03-30 23:27:52.992 [error] >>>>>>> <0.8036.323>@riak_api_pb_server:handle_info:141 Unrecognized message >>>>>>> {22243034,{error,timeout}} >>>>>>> >>>>>>> and some of these at the same time one of the kind below gets logged >>>>>>> (although the one has a different time stamp): >>>>>>> >>>>>>> 2013-03-30 23:27:53.056 [error] >>>>>>> <0.9457.323>@riak_kv_console:status:178 >>>>>>> Status failed error:terminated >>>>>>> >>>>>>> Ingo >>>>>>> >>>>>>> Am 03.04.2013 16:24, schrieb Evan Vigil-McClanahan: >>>>>>> >>>>>>>> Resending to the list: >>>>>>>> >>>>>>>> Ingo, >>>>>>>> >>>>>>>> That is an indication that the protocol buffers server can't spawn a >>>>>>>> put fsm, which means that a put cannot be done for some reason or >>>>>>>> another. Are there any other messages that appear around this time >>>>>>>> that might indicate why? >>>>>>>> >>>>>>>> On Wed, Apr 3, 2013 at 12:09 AM, Ingo Rockel >>>>>>>> <[email protected]> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> we have some performance issues with our riak cluster, from time to >>>>>>>>> time >>>>>>>>> we >>>>>>>>> have a sudden drop in performance (already asked the list about >>>>>>>>> this, >>>>>>>>> no-one >>>>>>>>> had an idea though). Although not the same time but on the >>>>>>>>> problematic >>>>>>>>> nodes >>>>>>>>> we have a lot of these messages from time to time: >>>>>>>>> >>>>>>>>> 2013-04-02 21:41:11.173 [warning] <0.25646.475> ** Can not start >>>>>>>>> proc_lib:init_p >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> ,[<0.14556.474>,[<0.9519.474>,riak_api_pb_sup,riak_api_sup,<0.1291.0>],riak_kv_p >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> ut_fsm,start_link,[{raw,65032165,<0.9519.474>},{r_object,<<109>>,<<77,115,124,49 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> ,53,55,57,56,57,56,50,124,49,51,54,52,57,51,49,54,49,49,53,49,50,52,53,54>>,[{r_ >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> content,{dict,0,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}, >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},<<>>}],[],{dict,2,16,16,8,8 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> 0,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[] >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> ,[],[],[[<<99,111,110,116,101,110,116,45,116,121,112,101>>,97,112,112,108,105,99 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> ,97,116,105,111,110,47,106,115,111,110]],[],[],[],[],[[<<99,104,97,114,115,101,1 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> 16>>,85,84,70,45,56]]}}},<<123,34,115,116,34,58,50,44,34,116,34,58,49,44,34,99,3 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> 4,58,34,66,117,116,32,115,104,101,32,105,115,32,103,111,110,101,44,32,110,32,101 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> ,118,101,110,32,116,104,111,117,103,104,32,105,109,32,110,111,116,32,105,110,32, >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> 117,114,32,99,105,116,121,32,105,32,108,111,118,101,32,117,32,110,100,32,105,32, >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> 109,101,97,110,32,105,116,32,58,39,40,34,44,34,114,34,58,49,52,51,52,54,52,51,57 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> ,44,34,115,34,58,49,53,55,57,56,57,56,50,44,34,99,116,34,58,49,51,54,52,57,51,49 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> ,54,49,49,53,49,50,44,34,97,110,34,58,102,97,108,115,101,44,34,115,107,34,58,49, >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> 51,54,52,57,51,49,54,49,49,53,49,50,52,53,54,44,34,115,117,34,58,48,125>>},[{tim >>>>>>>>> eout,60000}]]] on '[email protected]' ** >>>>>>>>> >>>>>>>>> Can anyone explain to me what these messages mean and if I need to >>>>>>>>> do >>>>>>>>> something about it? Could these messages be in any way related to >>>>>>>>> the >>>>>>>>> performance issues? >>>>>>>>> >>>>>>>>> Ingo >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> riak-users mailing list >>>>>>>>> [email protected] >>>>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Software Architect >>>>>>> >>>>>>> Blue Lion mobile GmbH >>>>>>> Tel. +49 (0) 221 788 797 14 >>>>>>> Fax. +49 (0) 221 788 797 19 >>>>>>> Mob. +49 (0) 176 24 87 30 89 >>>>>>> >>>>>>> [email protected] >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> qeep: Hefferwolf >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> www.bluelionmobile.com >>>>>>> www.qeep.net >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Software Architect >>>>> >>>>> Blue Lion mobile GmbH >>>>> Tel. +49 (0) 221 788 797 14 >>>>> Fax. +49 (0) 221 788 797 19 >>>>> Mob. +49 (0) 176 24 87 30 89 >>>>> >>>>> [email protected] >>>>>>>> >>>>>>>> >>>>>>>> qeep: Hefferwolf >>>>> >>>>> >>>>> >>>>> www.bluelionmobile.com >>>>> www.qeep.net >>> >>> >>> >>> >>> -- >>> Software Architect >>> >>> Blue Lion mobile GmbH >>> Tel. +49 (0) 221 788 797 14 >>> Fax. +49 (0) 221 788 797 19 >>> Mob. +49 (0) 176 24 87 30 89 >>> >>> [email protected] >>>>>> >>>>>> qeep: Hefferwolf >>> >>> >>> www.bluelionmobile.com >>> www.qeep.net > > > > -- > Software Architect > > Blue Lion mobile GmbH > Tel. +49 (0) 221 788 797 14 > Fax. +49 (0) 221 788 797 19 > Mob. +49 (0) 176 24 87 30 89 > > [email protected] >>>> qeep: Hefferwolf > > www.bluelionmobile.com > www.qeep.net _______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
