Re: Servers keep dying. How to understand why?

Julien Genestoux Fri, 17 May 2013 05:11:32 -0700

Christian, All

Our servers still have not died... but we see another strange behavior: our
data store needs a lot more space that what we expect.


Based on the status command, the average size of our object
(node_get_fsm_objsize_mean)
is about 1500 bytes.
We have 2 buckets, but both of them have a n value of 3.

When we count the values in each of the buckets (using the following
mapreduce)
curl -XPOST http://192.168.134.42:8098/mapred -H 'Content-Type:
application/json' -d
'{"inputs":"BUCKET","query":[{"reduce":{"language":"erlang","module":"riak_kv_mapreduce","function":"reduce_count_inputs","arg":{"do_prereduce":true}}}],"timeout":
100000}'

We get 194556 for one and 1572661 for the other one (these numbers are
consistent with what we expected), so if our math is right, we do need a
total disk of
3 * (194556 + 1572661 ) * 1500 bytes = 7.4 GB.

Now, though, when I inspect the storage actually occupied on our hard
drives, we see something weird:
(this is the du output)
riak1. 2802888 /var/lib/riak
riak2. 4159976 /var/lib/riak
riak5. 4603312 /var/lib/riak
riak3. 4915180 /var/lib/riak
riak4. 37466784  /var/lib/riak

As you can see not all nodes have the same "size". What's even weirder is
that up until a couple hours ago, they were all growing "together" and
close to what the riak4 node shows. Could this be due to the "delete"
policy? It turns out that we delete a lot of items (is there a way to get
the list of commands sent to a node/cluster?)

Thanks!



On Wed, May 15, 2013 at 11:29 PM, Julien Genestoux <
[email protected]> wrote:

> Christian, all,
>
> Not sure what kind of magic happend, but no server died in the last 2
> days... and counting.
> We have not changed a single line of code, which is quite odd...
> I'm still monitoring everything and hope (sic!) for a failure soon so we
> can fix the problem!
>
> Thanks
>
>
>
>
> --
> *Got a blog? Make following it simple: https://www.subtome.com/
> *
>
> Julien Genestoux,
> http://twitter.com/julien51
>
> +1 (415) 830 6574
> +33 (0)9 70 44 76 29
>
>
> On Tue, May 14, 2013 at 12:31 PM, Julien Genestoux <
> [email protected]> wrote:
>
>> Thanks Christian.
>> We do indeed use mapreduce but it's a fairly simple function:
>> We retrieve a first object whose value is an array of at most 10 ids and
>> then we fetch all the values for these 10 ids.
>> However, this mapreduce job is quite rare (maybe 10 times a day at most
>> at this point...) so I don't think that's our issue.
>> I'll try to run the cluster without any call to that to see if that's
>> better, but I'd be very surprised.  Also, we were doing this already even
>> before we allowed for multiple value and the cluster was stable back then.
>> We do not do key listing or anything like that.
>>
>> I'll try looking at the statistics too.
>>
>> Thanks,
>>
>>
>>
>>
>> On Tue, May 14, 2013 at 11:50 AM, Christian Dahlqvist <
>> [email protected]> wrote:
>>
>>> Hi Julien,
>>>
>>> The node appear to have crashed due to inability to allocate memory. How
>>> are you accessing your data? Are you running any key listing or large
>>> MapReduce jobs that could use up a lot of memory?
>>>
>>> In order to ensure that you are efficiently resolving siblings I would
>>> recommend you monitor the statistics in Riak (
>>> http://docs.basho.com/riak/latest/cookbooks/Statistics-and-Monitoring/).
>>> Specifically look at node_get_fsm_objsize_* and node_get_fsm_siblings_*
>>> statistics in order to identify objects that are very large or have lots of
>>> siblings.
>>>
>>> Best regards,
>>>
>>> Christian
>>>
>>>
>>>
>>> On 13 May 2013, at 16:44, Julien Genestoux <[email protected]>
>>> wrote:
>>>
>>> Christian, All,
>>>
>>> Bad news: my laptop is completely dead. Good news: I have a new one, and
>>> it's now fully operational (backups FTW!).
>>>
>>> The log files have finally been uploaded:
>>> https://www.dropbox.com/s/j7l3lniu0wogu29/riak-died.tar.gz
>>>
>>> I have attached to that mail our config.
>>>
>>> The machine is a virtual Xen instance at Linode with 4GB of memory. I
>>> know it's probably not the very best setup, but 1) we're on a budget and 2)
>>> we assumed that would fit our needs quite well.
>>>
>>> Just to put things in more details. Initially we did not use allow_mult
>>> and things worked out fine for a couple of days. As soon as we enabled 
>>> allow_mult,
>>> we were not able to run the cluster for more then 5 hours without seeing
>>> failing nodes, which is why I'm convinced we must be doing something wrong.
>>> The question is: what?
>>>
>>> Thanks
>>>
>>>
>>> On Sun, May 12, 2013 at 8:07 PM, Christian Dahlqvist <
>>> [email protected]> wrote:
>>>
>>>> Hi Julien,
>>>>
>>>> I was not able to access the logs based on the link you provided.
>>>>
>>>> Could you please attach a copy of your app.config file so we can get a
>>>> better understanding of the configuration of your cluster? Also, what is
>>>> the specification of the machines in the cluster?
>>>>
>>>> How much data do you have in the cluster and how are you querying it?
>>>>
>>>> Best regards,
>>>>
>>>> Christian
>>>>
>>>>
>>>>
>>>> On 12 May 2013, at 19:11, Julien Genestoux <[email protected]>
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> We are running a cluster of 5 servers, or at least trying to, because
>>>> nodes seem to be dying 'randomly'
>>>> without us knowing any reason why. We don't have a great Erlang guy
>>>> aboard, and the error logs are not
>>>> that verbose.
>>>> So I've just .tgz the whole log directory and I was hoping somebody
>>>> could give us a clue.
>>>> It's there: 
>>>> https://www.dropbox.com/s/z9ezv0qlxgfhcyq/riak-died.tar.gz(might not be 
>>>> fully uploaded to dropbox yet!)
>>>>
>>>> I've looked at the archive and some people said their server was dying
>>>> because some object's size was just
>>>> too big to allocate the whole memory. Maybe that's what we're seeing?
>>>>
>>>> As one of our buckets is set with allow_mult, I am tempted to think
>>>> that some object's size may be exploding.
>>>> However, we do actually try to resolve conflicts in our code. Any idea
>>>> how to confirm and then debug that we
>>>> have an issue there?
>>>>
>>>>
>>>> Thanks a lot for your precious help...
>>>>
>>>> Julien
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> riak-users mailing list
>>>> [email protected]
>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>
>>>>
>>>>
>>> <app.config>
>>>
>>>
>>>
>>
>

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Servers keep dying. How to understand why?

Reply via email to