Re: Servers keep dying. How to understand why?

Julien Genestoux Wed, 15 May 2013 14:30:49 -0700

Christian, all,

Not sure what kind of magic happend, but no server died in the last 2
days... and counting.
We have not changed a single line of code, which is quite odd...
I'm still monitoring everything and hope (sic!) for a failure soon so we
can fix the problem!


Thanks




--
*Got a blog? Make following it simple: https://www.subtome.com/
*

Julien Genestoux,
http://twitter.com/julien51

+1 (415) 830 6574
+33 (0)9 70 44 76 29


On Tue, May 14, 2013 at 12:31 PM, Julien Genestoux <
[email protected]> wrote:

> Thanks Christian.
> We do indeed use mapreduce but it's a fairly simple function:
> We retrieve a first object whose value is an array of at most 10 ids and
> then we fetch all the values for these 10 ids.
> However, this mapreduce job is quite rare (maybe 10 times a day at most at
> this point...) so I don't think that's our issue.
> I'll try to run the cluster without any call to that to see if that's
> better, but I'd be very surprised.  Also, we were doing this already even
> before we allowed for multiple value and the cluster was stable back then.
> We do not do key listing or anything like that.
>
> I'll try looking at the statistics too.
>
> Thanks,
>
>
>
>
> On Tue, May 14, 2013 at 11:50 AM, Christian Dahlqvist <[email protected]
> > wrote:
>
>> Hi Julien,
>>
>> The node appear to have crashed due to inability to allocate memory. How
>> are you accessing your data? Are you running any key listing or large
>> MapReduce jobs that could use up a lot of memory?
>>
>> In order to ensure that you are efficiently resolving siblings I would
>> recommend you monitor the statistics in Riak (
>> http://docs.basho.com/riak/latest/cookbooks/Statistics-and-Monitoring/).
>> Specifically look at node_get_fsm_objsize_* and node_get_fsm_siblings_*
>> statistics in order to identify objects that are very large or have lots of
>> siblings.
>>
>> Best regards,
>>
>> Christian
>>
>>
>>
>> On 13 May 2013, at 16:44, Julien Genestoux <[email protected]>
>> wrote:
>>
>> Christian, All,
>>
>> Bad news: my laptop is completely dead. Good news: I have a new one, and
>> it's now fully operational (backups FTW!).
>>
>> The log files have finally been uploaded:
>> https://www.dropbox.com/s/j7l3lniu0wogu29/riak-died.tar.gz
>>
>> I have attached to that mail our config.
>>
>> The machine is a virtual Xen instance at Linode with 4GB of memory. I
>> know it's probably not the very best setup, but 1) we're on a budget and 2)
>> we assumed that would fit our needs quite well.
>>
>> Just to put things in more details. Initially we did not use allow_mult
>> and things worked out fine for a couple of days. As soon as we enabled 
>> allow_mult,
>> we were not able to run the cluster for more then 5 hours without seeing
>> failing nodes, which is why I'm convinced we must be doing something wrong.
>> The question is: what?
>>
>> Thanks
>>
>>
>> On Sun, May 12, 2013 at 8:07 PM, Christian Dahlqvist <[email protected]
>> > wrote:
>>
>>> Hi Julien,
>>>
>>> I was not able to access the logs based on the link you provided.
>>>
>>> Could you please attach a copy of your app.config file so we can get a
>>> better understanding of the configuration of your cluster? Also, what is
>>> the specification of the machines in the cluster?
>>>
>>> How much data do you have in the cluster and how are you querying it?
>>>
>>> Best regards,
>>>
>>> Christian
>>>
>>>
>>>
>>> On 12 May 2013, at 19:11, Julien Genestoux <[email protected]>
>>> wrote:
>>>
>>> Hi,
>>>
>>> We are running a cluster of 5 servers, or at least trying to, because
>>> nodes seem to be dying 'randomly'
>>> without us knowing any reason why. We don't have a great Erlang guy
>>> aboard, and the error logs are not
>>> that verbose.
>>> So I've just .tgz the whole log directory and I was hoping somebody
>>> could give us a clue.
>>> It's there: 
>>> https://www.dropbox.com/s/z9ezv0qlxgfhcyq/riak-died.tar.gz(might not be 
>>> fully uploaded to dropbox yet!)
>>>
>>> I've looked at the archive and some people said their server was dying
>>> because some object's size was just
>>> too big to allocate the whole memory. Maybe that's what we're seeing?
>>>
>>> As one of our buckets is set with allow_mult, I am tempted to think that
>>> some object's size may be exploding.
>>> However, we do actually try to resolve conflicts in our code. Any idea
>>> how to confirm and then debug that we
>>> have an issue there?
>>>
>>>
>>> Thanks a lot for your precious help...
>>>
>>> Julien
>>>
>>>
>>>
>>> _______________________________________________
>>> riak-users mailing list
>>> [email protected]
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>
>>>
>>>
>> <app.config>
>>
>>
>>
>

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Servers keep dying. How to understand why?

Reply via email to