If your n_val is still three, then three sad nodes is a suspicious
number. My first guess would be a very large value being put in and
other requests backing up behind it.  That would explain the
health-check failures (especially if you're normally doing a lot of
small/fast reads and writes).

However, even that explanation doesn't get us anywhere near 500000
processes.  It'd be really nice to see that top output.  Maybe leave
it running and spooling to a file to see if you can capture the
output?  What does a frame of it look like now, without the problem
happening?


On Tue, Apr 2, 2013 at 7:31 AM, Dave Brady <[email protected]> wrote:
> It happened again today, though I was not available to watch it at the time.
>
> Three nodes each showed riak_kv being stopped for one minute:
>
> 2013-04-02 11:10:57.923 [info] <0.2833.1447>@riak_kv_app:check_kv_health:239 
> Disabling riak_kv due to large message queues. Offending vnodes: 
> [{319703483166135013357056057156686910549735243776,5798}]
> 2013-04-02 11:11:57.924 [info] <0.3589.1447>@riak_kv_app:check_kv_health:242 
> Re-enabling riak_kv after successful health check
>
> --
> Dave Brady
>
> ----- Original Message -----
> From: "Dave Brady" <[email protected]>
> To: "Evan Vigil-McClanahan" <[email protected]>
> Cc: [email protected]
> Sent: Monday, April 1, 2013 11:15:47 AM GMT +01:00 Amsterdam / Berlin / Bern 
> / Rome / Stockholm / Vienna
> Subject: Re: Having to raise VM number-of-processes limit
>
> Hi Evan,
>
> Thanks for the suggestions!
>
> I did not think that raising that limit was normal.  Glad to have 
> confirmation.
>
> I'll go through the logs again, and run 'riak-admin top ...' the next time it 
> happens.
>
> --
> Dave Brady
>
> ----- Original Message -----
> From: "Evan Vigil-McClanahan" <[email protected]>
> To: "Dave Brady" <[email protected]>
> Cc: [email protected]
> Sent: Saturday, March 30, 2013 11:03:30 PM GMT +01:00 Amsterdam / Berlin / 
> Bern / Rome / Stockholm / Vienna
> Subject: Re: Having to raise VM number-of-processes limit
>
> Dave,
>
> If you're seeing the process count go that high, it suggests to me
> that something else is wrong.  Typically, even for heavily loaded
> clusters, hundreds of thousands of processes isn't normal.  Is there
> anything else in the logs?
>
> When a node sees this sort of behavior start, does riak-admin top
> -sort msg_q look like?
>
> On Sat, Mar 30, 2013 at 2:07 PM, Dave Brady <[email protected]> wrote:
>> Hello,
>>
>> I have run into a situation whereby I started seeing:
>>
>> [error] emulator Too many processes
>>
>> when some of our new jobs ran.  These jobs are in perl using Net::Riak,
>> communicating to the cluster via PBC.  They fire tens of thousands of fetchs
>> and stores over the course of about 20 minutes.
>>
>> Our cluster has five nodes with 1.3, using eLevelDB.
>>
>> I have been raising the limit (+P in vm.args) in increments from the default
>> of 32768.  Currently at 524288, and that is still not high enough.
>>
>> Have any of you had to increase this limit?
>>
>> Thanks!
>>
>> _______________________________________________
>> riak-users mailing list
>> [email protected]
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>
> _______________________________________________
> riak-users mailing list
> [email protected]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to