Christian, all, Not sure what kind of magic happend, but no server died in the last 2 days... and counting. We have not changed a single line of code, which is quite odd... I'm still monitoring everything and hope (sic!) for a failure soon so we can fix the problem!
Thanks -- *Got a blog? Make following it simple: https://www.subtome.com/ * Julien Genestoux, http://twitter.com/julien51 +1 (415) 830 6574 +33 (0)9 70 44 76 29 On Tue, May 14, 2013 at 12:31 PM, Julien Genestoux < [email protected]> wrote: > Thanks Christian. > We do indeed use mapreduce but it's a fairly simple function: > We retrieve a first object whose value is an array of at most 10 ids and > then we fetch all the values for these 10 ids. > However, this mapreduce job is quite rare (maybe 10 times a day at most at > this point...) so I don't think that's our issue. > I'll try to run the cluster without any call to that to see if that's > better, but I'd be very surprised. Also, we were doing this already even > before we allowed for multiple value and the cluster was stable back then. > We do not do key listing or anything like that. > > I'll try looking at the statistics too. > > Thanks, > > > > > On Tue, May 14, 2013 at 11:50 AM, Christian Dahlqvist <[email protected] > > wrote: > >> Hi Julien, >> >> The node appear to have crashed due to inability to allocate memory. How >> are you accessing your data? Are you running any key listing or large >> MapReduce jobs that could use up a lot of memory? >> >> In order to ensure that you are efficiently resolving siblings I would >> recommend you monitor the statistics in Riak ( >> http://docs.basho.com/riak/latest/cookbooks/Statistics-and-Monitoring/). >> Specifically look at node_get_fsm_objsize_* and node_get_fsm_siblings_* >> statistics in order to identify objects that are very large or have lots of >> siblings. >> >> Best regards, >> >> Christian >> >> >> >> On 13 May 2013, at 16:44, Julien Genestoux <[email protected]> >> wrote: >> >> Christian, All, >> >> Bad news: my laptop is completely dead. Good news: I have a new one, and >> it's now fully operational (backups FTW!). >> >> The log files have finally been uploaded: >> https://www.dropbox.com/s/j7l3lniu0wogu29/riak-died.tar.gz >> >> I have attached to that mail our config. >> >> The machine is a virtual Xen instance at Linode with 4GB of memory. I >> know it's probably not the very best setup, but 1) we're on a budget and 2) >> we assumed that would fit our needs quite well. >> >> Just to put things in more details. Initially we did not use allow_mult >> and things worked out fine for a couple of days. As soon as we enabled >> allow_mult, >> we were not able to run the cluster for more then 5 hours without seeing >> failing nodes, which is why I'm convinced we must be doing something wrong. >> The question is: what? >> >> Thanks >> >> >> On Sun, May 12, 2013 at 8:07 PM, Christian Dahlqvist <[email protected] >> > wrote: >> >>> Hi Julien, >>> >>> I was not able to access the logs based on the link you provided. >>> >>> Could you please attach a copy of your app.config file so we can get a >>> better understanding of the configuration of your cluster? Also, what is >>> the specification of the machines in the cluster? >>> >>> How much data do you have in the cluster and how are you querying it? >>> >>> Best regards, >>> >>> Christian >>> >>> >>> >>> On 12 May 2013, at 19:11, Julien Genestoux <[email protected]> >>> wrote: >>> >>> Hi, >>> >>> We are running a cluster of 5 servers, or at least trying to, because >>> nodes seem to be dying 'randomly' >>> without us knowing any reason why. We don't have a great Erlang guy >>> aboard, and the error logs are not >>> that verbose. >>> So I've just .tgz the whole log directory and I was hoping somebody >>> could give us a clue. >>> It's there: >>> https://www.dropbox.com/s/z9ezv0qlxgfhcyq/riak-died.tar.gz(might not be >>> fully uploaded to dropbox yet!) >>> >>> I've looked at the archive and some people said their server was dying >>> because some object's size was just >>> too big to allocate the whole memory. Maybe that's what we're seeing? >>> >>> As one of our buckets is set with allow_mult, I am tempted to think that >>> some object's size may be exploding. >>> However, we do actually try to resolve conflicts in our code. Any idea >>> how to confirm and then debug that we >>> have an issue there? >>> >>> >>> Thanks a lot for your precious help... >>> >>> Julien >>> >>> >>> >>> _______________________________________________ >>> riak-users mailing list >>> [email protected] >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>> >>> >>> >> <app.config> >> >> >> >
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
