it does not use whole memory :). also, small misprint - it is 25 for this pool (26 is different pool). I've sent you stats
On Wednesday, July 2, 2014 8:27:09 PM UTC-7, Zhiwei Chan wrote: > > Hi, > with the hash power 26, slab 13, that means (2**26)*1.5*1488=142G > memory is needed. Could you please put the stats info to this thread or > send a copy for me too? And, is that " tons of 'allocation failure' " > the system log or the "outofmemory" statistic in memcached? At last, i > think generate a core file withe debug-info is great helpful. > Thanks. > > > 2014-07-03 6:43 GMT+08:00 Denis Samoylov <[email protected] <javascript:>> > : > >> >1) OOM's on slab 13, but it recovered on its own? This is under version >> 1.4.20 and you did *not* enable tail repairs? >> correct >> >> >> >2) Can you share (with me at least) the full stats/stats items/stats >> slabs output from one of the affected servers running 1.4.20? >> sent you _current_ stats from the server that had OOM couple days ago and >> still running (now with no issues). >> >> >> >3) Can you confirm that 1.4.20 isn't *crashing*, but is >> actually exhibiting write failures? >> correct >> >> we will enable saving stderr to log. may be this can show something. If >> you have any other ideas - let me know. >> >> -denis >> >> >> >> On Wednesday, July 2, 2014 1:36:57 PM UTC-7, Dormando wrote: >>> >>> Cool. That is disappointing. >>> >>> Can you clarify a few things for me: >>> >>> 1) You're saying that you were getting OOM's on slab 13, but it >>> recovered >>> on its own? This is under version 1.4.20 and you did *not* enable tail >>> repairs? >>> >>> 2) Can you share (with me at least) the full stats/stats items/stats >>> slabs >>> output from one of the affected servers running 1.4.20? >>> >>> 3) Can you confirm that 1.4.20 isn't *crashing*, but is actually >>> exhibiting write failures? >>> >>> If it's not a crash, and your hash power level isn't expanding, I don't >>> think it's related to the other bug. >>> >>> thanks! >>> >>> On Wed, 2 Jul 2014, Denis Samoylov wrote: >>> >>> > Dormando, sure, we will add option to preset hashtable. (as i see nn >>> should be 26). >>> > One question: as i see in logs for the servers there is no change >>> for hash_power_level before incident (it would be hard to say for crushed >>> but .20 >>> > just had outofmemory and i have solid stats). Does not this contradict >>> the idea of cause? Server had hash_power_level = 26 for days before and >>> > still has 26 days after. Just for three hours every set for slab 13 >>> failed. We did not reboot/flush server and it continues to work without >>> > problem. What do you think? >>> > >>> > On Tuesday, July 1, 2014 2:43:49 PM UTC-7, Dormando wrote: >>> > Hey, >>> > >>> > Can you presize the hash table? (-o hashpower=nn) to be large >>> enough on >>> > those servers such that hash expansion won't happen at runtime? >>> You can >>> > see what hashpower is on a long running server via stats to know >>> what to >>> > set the value to. >>> > >>> > If that helps, we might still have a bug in hash expansion. I >>> see someone >>> > finally reproduced a possible issue there under .20. .17/.19 fix >>> other >>> > causes of the problem pretty thoroughly though. >>> > >>> > On Tue, 1 Jul 2014, Denis Samoylov wrote: >>> > >>> > > Hi, >>> > > We had sporadic memory corruption due tail repair in pre .20 >>> version. So we updated some our servers to .20. This Monday we observed >>> > several >>> > > crushes in .15 version and tons of "allocation failure" in .20 >>> version. This is expected as .20 just disables "tail repair" but it >>> > seems the >>> > > problem is still there. What is interesting: >>> > > 1) there is no visible change in traffic and only one slab is >>> affected usually. >>> > > 2) this always happens with several but not all servers :) >>> > > >>> > > Is there any way to catch this and help with debug? I have all >>> slab and item stats for the time around incident for .15 and .20 >>> > version. .15 is >>> > > clearly memory corruption: gdb shows that hash function >>> returned 0 (line 115 uint32_t hv = hash(ITEM_key(search), search->nkey, >>> 0);). >>> > > >>> > > so we seems hitting this comment: >>> > > /* Old rare bug could cause a refcount leak. We >>> haven't seen >>> > > * it in years, but we leave this code in to >>> prevent failures >>> > > * just in case */ >>> > > >>> > > :) >>> > > >>> > > Thank you, >>> > > Denis >>> > > >>> > > -- >>> > > >>> > > --- >>> > > You received this message because you are subscribed to the >>> Google Groups "memcached" group. >>> > > To unsubscribe from this group and stop receiving emails from >>> it, send an email to [email protected]. >>> > > For more options, visit https://groups.google.com/d/optout. >>> > > >>> > > >>> > >>> > -- >>> > >>> > --- >>> > You received this message because you are subscribed to the Google >>> Groups "memcached" group. >>> > To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> > For more options, visit https://groups.google.com/d/optout. >>> > >>> > >> >> -- >> >> --- >> You received this message because you are subscribed to the Google Groups >> "memcached" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> For more options, visit https://groups.google.com/d/optout. >> > > -- --- You received this message because you are subscribed to the Google Groups "memcached" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
