Re: tail repair issue (1.4.20)

Denis Samoylov Wed, 02 Jul 2014 12:35:00 -0700

Dormando, sure, we will add option to preset hashtable. (as i see nn should 
be 26).


One question: as i see in logs for the servers there is no change for 
hash_power_level 
before incident (it would be hard to say for crushed but .20 just had 
outofmemory and i have solid stats). Does not this contradict the idea of 
cause? Server had hash_power_level = 26 for days before and still has 26 
days after. Just for three hours every set for slab 13 failed. We did not 
reboot/flush server and it continues to work without problem. What do you 
think?

On Tuesday, July 1, 2014 2:43:49 PM UTC-7, Dormando wrote:
>
> Hey, 
>
> Can you presize the hash table? (-o hashpower=nn) to be large enough on 
> those servers such that hash expansion won't happen at runtime? You can 
> see what hashpower is on a long running server via stats to know what to 
> set the value to. 
>
> If that helps, we might still have a bug in hash expansion. I see someone 
> finally reproduced a possible issue there under .20. .17/.19 fix other 
> causes of the problem pretty thoroughly though. 
>
> On Tue, 1 Jul 2014, Denis Samoylov wrote: 
>
> > Hi, 
> > We had sporadic memory corruption due tail repair in pre .20 version. So 
> we updated some our servers to .20. This Monday we observed several 
> > crushes in .15 version and tons of "allocation failure" in .20 version. 
> This is expected as .20 just disables "tail repair" but it seems the 
> > problem is still there. What is interesting: 
> > 1) there is no visible change in traffic and only one slab is affected 
> usually.  
> > 2) this always happens with several but not all servers :) 
> > 
> > Is there any way to catch this and help with debug? I have all slab and 
> item stats for the time around incident for .15 and .20 version. .15 is 
> > clearly memory corruption: gdb shows that hash function returned 0 (line 
> 115 uint32_t hv = hash(ITEM_key(search), search->nkey, 0);). 
> > 
> > so we seems hitting this comment: 
> >             /* Old rare bug could cause a refcount leak. We haven't seen 
> >              * it in years, but we leave this code in to prevent 
> failures 
> >              * just in case */ 
> > 
> > :) 
> > 
> > Thank you, 
> > Denis 
> > 
> > -- 
> > 
> > --- 
> > You received this message because you are subscribed to the Google 
> Groups "memcached" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to [email protected] <javascript:>. 
> > For more options, visit https://groups.google.com/d/optout. 
> > 
> >

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: tail repair issue (1.4.20)

Reply via email to