>1) OOM's on slab 13, but it recovered on its own? This is under version 1.4.20 and you did *not* enable tail repairs? correct
>2) Can you share (with me at least) the full stats/stats items/stats slabs output from one of the affected servers running 1.4.20? sent you _current_ stats from the server that had OOM couple days ago and still running (now with no issues). >3) Can you confirm that 1.4.20 isn't *crashing*, but is actually exhibiting write failures? correct we will enable saving stderr to log. may be this can show something. If you have any other ideas - let me know. -denis On Wednesday, July 2, 2014 1:36:57 PM UTC-7, Dormando wrote: > > Cool. That is disappointing. > > Can you clarify a few things for me: > > 1) You're saying that you were getting OOM's on slab 13, but it recovered > on its own? This is under version 1.4.20 and you did *not* enable tail > repairs? > > 2) Can you share (with me at least) the full stats/stats items/stats slabs > output from one of the affected servers running 1.4.20? > > 3) Can you confirm that 1.4.20 isn't *crashing*, but is actually > exhibiting write failures? > > If it's not a crash, and your hash power level isn't expanding, I don't > think it's related to the other bug. > > thanks! > > On Wed, 2 Jul 2014, Denis Samoylov wrote: > > > Dormando, sure, we will add option to preset hashtable. (as i see nn > should be 26). > > One question: as i see in logs for the servers there is no change > for hash_power_level before incident (it would be hard to say for crushed > but .20 > > just had outofmemory and i have solid stats). Does not this contradict > the idea of cause? Server had hash_power_level = 26 for days before and > > still has 26 days after. Just for three hours every set for slab 13 > failed. We did not reboot/flush server and it continues to work without > > problem. What do you think? > > > > On Tuesday, July 1, 2014 2:43:49 PM UTC-7, Dormando wrote: > > Hey, > > > > Can you presize the hash table? (-o hashpower=nn) to be large > enough on > > those servers such that hash expansion won't happen at runtime? > You can > > see what hashpower is on a long running server via stats to know > what to > > set the value to. > > > > If that helps, we might still have a bug in hash expansion. I see > someone > > finally reproduced a possible issue there under .20. .17/.19 fix > other > > causes of the problem pretty thoroughly though. > > > > On Tue, 1 Jul 2014, Denis Samoylov wrote: > > > > > Hi, > > > We had sporadic memory corruption due tail repair in pre .20 > version. So we updated some our servers to .20. This Monday we observed > > several > > > crushes in .15 version and tons of "allocation failure" in .20 > version. This is expected as .20 just disables "tail repair" but it > > seems the > > > problem is still there. What is interesting: > > > 1) there is no visible change in traffic and only one slab is > affected usually. > > > 2) this always happens with several but not all servers :) > > > > > > Is there any way to catch this and help with debug? I have all > slab and item stats for the time around incident for .15 and .20 > > version. .15 is > > > clearly memory corruption: gdb shows that hash function returned > 0 (line 115 uint32_t hv = hash(ITEM_key(search), search->nkey, 0);). > > > > > > so we seems hitting this comment: > > > /* Old rare bug could cause a refcount leak. We > haven't seen > > > * it in years, but we leave this code in to prevent > failures > > > * just in case */ > > > > > > :) > > > > > > Thank you, > > > Denis > > > > > > -- > > > > > > --- > > > You received this message because you are subscribed to the > Google Groups "memcached" group. > > > To unsubscribe from this group and stop receiving emails from > it, send an email to [email protected]. > > > For more options, visit https://groups.google.com/d/optout. > > > > > > > > > > -- > > > > --- > > You received this message because you are subscribed to the Google > Groups "memcached" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email to [email protected] <javascript:>. > > For more options, visit https://groups.google.com/d/optout. > > > > -- --- You received this message because you are subscribed to the Google Groups "memcached" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
