Christian, All Our servers still have not died... but we see another strange behavior: our data store needs a lot more space that what we expect.
Based on the status command, the average size of our object (node_get_fsm_objsize_mean) is about 1500 bytes. We have 2 buckets, but both of them have a n value of 3. When we count the values in each of the buckets (using the following mapreduce) curl -XPOST http://192.168.134.42:8098/mapred -H 'Content-Type: application/json' -d '{"inputs":"BUCKET","query":[{"reduce":{"language":"erlang","module":"riak_kv_mapreduce","function":"reduce_count_inputs","arg":{"do_prereduce":true}}}],"timeout": 100000}' We get 194556 for one and 1572661 for the other one (these numbers are consistent with what we expected), so if our math is right, we do need a total disk of 3 * (194556 + 1572661 ) * 1500 bytes = 7.4 GB. Now, though, when I inspect the storage actually occupied on our hard drives, we see something weird: (this is the du output) riak1. 2802888 /var/lib/riak riak2. 4159976 /var/lib/riak riak5. 4603312 /var/lib/riak riak3. 4915180 /var/lib/riak riak4. 37466784 /var/lib/riak As you can see not all nodes have the same "size". What's even weirder is that up until a couple hours ago, they were all growing "together" and close to what the riak4 node shows. Could this be due to the "delete" policy? It turns out that we delete a lot of items (is there a way to get the list of commands sent to a node/cluster?) Thanks! On Wed, May 15, 2013 at 11:29 PM, Julien Genestoux < [email protected]> wrote: > Christian, all, > > Not sure what kind of magic happend, but no server died in the last 2 > days... and counting. > We have not changed a single line of code, which is quite odd... > I'm still monitoring everything and hope (sic!) for a failure soon so we > can fix the problem! > > Thanks > > > > > -- > *Got a blog? Make following it simple: https://www.subtome.com/ > * > > Julien Genestoux, > http://twitter.com/julien51 > > +1 (415) 830 6574 > +33 (0)9 70 44 76 29 > > > On Tue, May 14, 2013 at 12:31 PM, Julien Genestoux < > [email protected]> wrote: > >> Thanks Christian. >> We do indeed use mapreduce but it's a fairly simple function: >> We retrieve a first object whose value is an array of at most 10 ids and >> then we fetch all the values for these 10 ids. >> However, this mapreduce job is quite rare (maybe 10 times a day at most >> at this point...) so I don't think that's our issue. >> I'll try to run the cluster without any call to that to see if that's >> better, but I'd be very surprised. Also, we were doing this already even >> before we allowed for multiple value and the cluster was stable back then. >> We do not do key listing or anything like that. >> >> I'll try looking at the statistics too. >> >> Thanks, >> >> >> >> >> On Tue, May 14, 2013 at 11:50 AM, Christian Dahlqvist < >> [email protected]> wrote: >> >>> Hi Julien, >>> >>> The node appear to have crashed due to inability to allocate memory. How >>> are you accessing your data? Are you running any key listing or large >>> MapReduce jobs that could use up a lot of memory? >>> >>> In order to ensure that you are efficiently resolving siblings I would >>> recommend you monitor the statistics in Riak ( >>> http://docs.basho.com/riak/latest/cookbooks/Statistics-and-Monitoring/). >>> Specifically look at node_get_fsm_objsize_* and node_get_fsm_siblings_* >>> statistics in order to identify objects that are very large or have lots of >>> siblings. >>> >>> Best regards, >>> >>> Christian >>> >>> >>> >>> On 13 May 2013, at 16:44, Julien Genestoux <[email protected]> >>> wrote: >>> >>> Christian, All, >>> >>> Bad news: my laptop is completely dead. Good news: I have a new one, and >>> it's now fully operational (backups FTW!). >>> >>> The log files have finally been uploaded: >>> https://www.dropbox.com/s/j7l3lniu0wogu29/riak-died.tar.gz >>> >>> I have attached to that mail our config. >>> >>> The machine is a virtual Xen instance at Linode with 4GB of memory. I >>> know it's probably not the very best setup, but 1) we're on a budget and 2) >>> we assumed that would fit our needs quite well. >>> >>> Just to put things in more details. Initially we did not use allow_mult >>> and things worked out fine for a couple of days. As soon as we enabled >>> allow_mult, >>> we were not able to run the cluster for more then 5 hours without seeing >>> failing nodes, which is why I'm convinced we must be doing something wrong. >>> The question is: what? >>> >>> Thanks >>> >>> >>> On Sun, May 12, 2013 at 8:07 PM, Christian Dahlqvist < >>> [email protected]> wrote: >>> >>>> Hi Julien, >>>> >>>> I was not able to access the logs based on the link you provided. >>>> >>>> Could you please attach a copy of your app.config file so we can get a >>>> better understanding of the configuration of your cluster? Also, what is >>>> the specification of the machines in the cluster? >>>> >>>> How much data do you have in the cluster and how are you querying it? >>>> >>>> Best regards, >>>> >>>> Christian >>>> >>>> >>>> >>>> On 12 May 2013, at 19:11, Julien Genestoux <[email protected]> >>>> wrote: >>>> >>>> Hi, >>>> >>>> We are running a cluster of 5 servers, or at least trying to, because >>>> nodes seem to be dying 'randomly' >>>> without us knowing any reason why. We don't have a great Erlang guy >>>> aboard, and the error logs are not >>>> that verbose. >>>> So I've just .tgz the whole log directory and I was hoping somebody >>>> could give us a clue. >>>> It's there: >>>> https://www.dropbox.com/s/z9ezv0qlxgfhcyq/riak-died.tar.gz(might not be >>>> fully uploaded to dropbox yet!) >>>> >>>> I've looked at the archive and some people said their server was dying >>>> because some object's size was just >>>> too big to allocate the whole memory. Maybe that's what we're seeing? >>>> >>>> As one of our buckets is set with allow_mult, I am tempted to think >>>> that some object's size may be exploding. >>>> However, we do actually try to resolve conflicts in our code. Any idea >>>> how to confirm and then debug that we >>>> have an issue there? >>>> >>>> >>>> Thanks a lot for your precious help... >>>> >>>> Julien >>>> >>>> >>>> >>>> _______________________________________________ >>>> riak-users mailing list >>>> [email protected] >>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>>> >>>> >>>> >>> <app.config> >>> >>> >>> >> >
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
