Re: tail repair issue (1.4.20)

Denis Samoylov Wed, 02 Jul 2014 15:44:26 -0700

>1)  OOM's on slab 13, but it recovered on its own? This is under version 
1.4.20 and you did *not* enable tail repairs? 
correct


>2) Can you share (with me at least) the full stats/stats items/stats 
slabs output from one of the affected servers running 1.4.20? 
sent you _current_ stats from the server that had OOM couple days ago and 
still running (now with no issues).

>3) Can you confirm that 1.4.20 isn't *crashing*, but is 
actually exhibiting write failures? 
correct

we will enable saving stderr to log. may be this can show something. If you 
have any other ideas - let me know.

-denis



On Wednesday, July 2, 2014 1:36:57 PM UTC-7, Dormando wrote:
>
> Cool. That is disappointing. 
>
> Can you clarify a few things for me: 
>
> 1) You're saying that you were getting OOM's on slab 13, but it recovered 
> on its own? This is under version 1.4.20 and you did *not* enable tail 
> repairs? 
>
> 2) Can you share (with me at least) the full stats/stats items/stats slabs 
> output from one of the affected servers running 1.4.20? 
>
> 3) Can you confirm that 1.4.20 isn't *crashing*, but is actually 
> exhibiting write failures? 
>
> If it's not a crash, and your hash power level isn't expanding, I don't 
> think it's related to the other bug. 
>
> thanks! 
>
> On Wed, 2 Jul 2014, Denis Samoylov wrote: 
>
> > Dormando, sure, we will add option to preset hashtable. (as i see nn 
> should be 26). 
> > One question: as i see in logs for the servers there is no change 
> for hash_power_level before incident (it would be hard to say for crushed 
> but .20 
> > just had outofmemory and i have solid stats). Does not this contradict 
> the idea of cause? Server had hash_power_level = 26 for days before and 
> > still has 26 days after. Just for three hours every set for slab 13 
> failed. We did not reboot/flush server and it continues to work without 
> > problem. What do you think? 
> > 
> > On Tuesday, July 1, 2014 2:43:49 PM UTC-7, Dormando wrote: 
> >       Hey, 
> > 
> >       Can you presize the hash table? (-o hashpower=nn) to be large 
> enough on 
> >       those servers such that hash expansion won't happen at runtime? 
> You can 
> >       see what hashpower is on a long running server via stats to know 
> what to 
> >       set the value to. 
> > 
> >       If that helps, we might still have a bug in hash expansion. I see 
> someone 
> >       finally reproduced a possible issue there under .20. .17/.19 fix 
> other 
> >       causes of the problem pretty thoroughly though. 
> > 
> >       On Tue, 1 Jul 2014, Denis Samoylov wrote: 
> > 
> >       > Hi, 
> >       > We had sporadic memory corruption due tail repair in pre .20 
> version. So we updated some our servers to .20. This Monday we observed 
> >       several 
> >       > crushes in .15 version and tons of "allocation failure" in .20 
> version. This is expected as .20 just disables "tail repair" but it 
> >       seems the 
> >       > problem is still there. What is interesting: 
> >       > 1) there is no visible change in traffic and only one slab is 
> affected usually.  
> >       > 2) this always happens with several but not all servers :) 
> >       > 
> >       > Is there any way to catch this and help with debug? I have all 
> slab and item stats for the time around incident for .15 and .20 
> >       version. .15 is 
> >       > clearly memory corruption: gdb shows that hash function returned 
> 0 (line 115 uint32_t hv = hash(ITEM_key(search), search->nkey, 0);). 
> >       > 
> >       > so we seems hitting this comment: 
> >       >             /* Old rare bug could cause a refcount leak. We 
> haven't seen 
> >       >              * it in years, but we leave this code in to prevent 
> failures 
> >       >              * just in case */ 
> >       > 
> >       > :) 
> >       > 
> >       > Thank you, 
> >       > Denis 
> >       > 
> >       > -- 
> >       > 
> >       > --- 
> >       > You received this message because you are subscribed to the 
> Google Groups "memcached" group. 
> >       > To unsubscribe from this group and stop receiving emails from 
> it, send an email to [email protected]. 
> >       > For more options, visit https://groups.google.com/d/optout. 
> >       > 
> >       > 
> > 
> > -- 
> > 
> > --- 
> > You received this message because you are subscribed to the Google 
> Groups "memcached" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to [email protected] <javascript:>. 
> > For more options, visit https://groups.google.com/d/optout. 
> > 
> >

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: tail repair issue (1.4.20)

Reply via email to