Re: tail repair issue (1.4.20)

Denis Samoylov Mon, 07 Jul 2014 14:53:33 -0700

it does not use whole memory :). also, small misprint - it is 25 for this 
pool (26 is different pool). I've sent you stats


On Wednesday, July 2, 2014 8:27:09 PM UTC-7, Zhiwei Chan wrote:
>
>   Hi,
>    with the hash power 26, slab 13, that means (2**26)*1.5*1488=142G 
> memory is needed. Could you please put the stats info to this thread or 
> send a copy for me too?  And,  is that " tons of 'allocation failure' " 
> the system log or the "outofmemory" statistic in memcached? At last, i 
> think generate a core file withe debug-info is great helpful.
>   Thanks.
>
>
> 2014-07-03 6:43 GMT+08:00 Denis Samoylov <[email protected] <javascript:>>
> :
>
>> >1)  OOM's on slab 13, but it recovered on its own? This is under version 
>> 1.4.20 and you did *not* enable tail repairs? 
>> correct
>>
>>
>> >2) Can you share (with me at least) the full stats/stats items/stats 
>> slabs output from one of the affected servers running 1.4.20? 
>> sent you _current_ stats from the server that had OOM couple days ago and 
>> still running (now with no issues).
>>
>>
>> >3) Can you confirm that 1.4.20 isn't *crashing*, but is 
>> actually exhibiting write failures? 
>> correct
>>
>> we will enable saving stderr to log. may be this can show something. If 
>> you have any other ideas - let me know.
>>
>> -denis
>>
>>
>>
>> On Wednesday, July 2, 2014 1:36:57 PM UTC-7, Dormando wrote:
>>>
>>> Cool. That is disappointing. 
>>>
>>> Can you clarify a few things for me: 
>>>
>>> 1) You're saying that you were getting OOM's on slab 13, but it 
>>> recovered 
>>> on its own? This is under version 1.4.20 and you did *not* enable tail 
>>> repairs? 
>>>
>>> 2) Can you share (with me at least) the full stats/stats items/stats 
>>> slabs 
>>> output from one of the affected servers running 1.4.20? 
>>>
>>> 3) Can you confirm that 1.4.20 isn't *crashing*, but is actually 
>>> exhibiting write failures? 
>>>
>>> If it's not a crash, and your hash power level isn't expanding, I don't 
>>> think it's related to the other bug. 
>>>
>>> thanks! 
>>>
>>> On Wed, 2 Jul 2014, Denis Samoylov wrote: 
>>>
>>> > Dormando, sure, we will add option to preset hashtable. (as i see nn 
>>> should be 26). 
>>> > One question: as i see in logs for the servers there is no change 
>>> for hash_power_level before incident (it would be hard to say for crushed 
>>> but .20 
>>> > just had outofmemory and i have solid stats). Does not this contradict 
>>> the idea of cause? Server had hash_power_level = 26 for days before and 
>>> > still has 26 days after. Just for three hours every set for slab 13 
>>> failed. We did not reboot/flush server and it continues to work without 
>>> > problem. What do you think? 
>>> > 
>>> > On Tuesday, July 1, 2014 2:43:49 PM UTC-7, Dormando wrote: 
>>> >       Hey, 
>>> > 
>>> >       Can you presize the hash table? (-o hashpower=nn) to be large 
>>> enough on 
>>> >       those servers such that hash expansion won't happen at runtime? 
>>> You can 
>>> >       see what hashpower is on a long running server via stats to know 
>>> what to 
>>> >       set the value to. 
>>> > 
>>> >       If that helps, we might still have a bug in hash expansion. I 
>>> see someone 
>>> >       finally reproduced a possible issue there under .20. .17/.19 fix 
>>> other 
>>> >       causes of the problem pretty thoroughly though. 
>>> > 
>>> >       On Tue, 1 Jul 2014, Denis Samoylov wrote: 
>>> > 
>>> >       > Hi, 
>>> >       > We had sporadic memory corruption due tail repair in pre .20 
>>> version. So we updated some our servers to .20. This Monday we observed 
>>> >       several 
>>> >       > crushes in .15 version and tons of "allocation failure" in .20 
>>> version. This is expected as .20 just disables "tail repair" but it 
>>> >       seems the 
>>> >       > problem is still there. What is interesting: 
>>> >       > 1) there is no visible change in traffic and only one slab is 
>>> affected usually.  
>>> >       > 2) this always happens with several but not all servers :) 
>>> >       > 
>>> >       > Is there any way to catch this and help with debug? I have all 
>>> slab and item stats for the time around incident for .15 and .20 
>>> >       version. .15 is 
>>> >       > clearly memory corruption: gdb shows that hash function 
>>> returned 0 (line 115 uint32_t hv = hash(ITEM_key(search), search->nkey, 
>>> 0);). 
>>> >       > 
>>> >       > so we seems hitting this comment: 
>>> >       >             /* Old rare bug could cause a refcount leak. We 
>>> haven't seen 
>>> >       >              * it in years, but we leave this code in to 
>>> prevent failures 
>>> >       >              * just in case */ 
>>> >       > 
>>> >       > :) 
>>> >       > 
>>> >       > Thank you, 
>>> >       > Denis 
>>> >       > 
>>> >       > -- 
>>> >       > 
>>> >       > --- 
>>> >       > You received this message because you are subscribed to the 
>>> Google Groups "memcached" group. 
>>> >       > To unsubscribe from this group and stop receiving emails from 
>>> it, send an email to [email protected]. 
>>> >       > For more options, visit https://groups.google.com/d/optout. 
>>> >       > 
>>> >       > 
>>> > 
>>> > -- 
>>> > 
>>> > --- 
>>> > You received this message because you are subscribed to the Google 
>>> Groups "memcached" group. 
>>> > To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected]. 
>>> > For more options, visit https://groups.google.com/d/optout. 
>>> > 
>>> >
>>
>>  -- 
>>
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "memcached" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: tail repair issue (1.4.20)

Reply via email to