Hi, with the hash power 26, slab 13, that means (2**26)*1.5*1488=142G memory is needed. Could you please put the stats info to this thread or send a copy for me too? And, is that " tons of 'allocation failure' " the system log or the "outofmemory" statistic in memcached? At last, i think generate a core file withe debug-info is great helpful. Thanks.
2014-07-03 6:43 GMT+08:00 Denis Samoylov <[email protected]>: > >1) OOM's on slab 13, but it recovered on its own? This is under version > 1.4.20 and you did *not* enable tail repairs? > correct > > > >2) Can you share (with me at least) the full stats/stats items/stats > slabs output from one of the affected servers running 1.4.20? > sent you _current_ stats from the server that had OOM couple days ago and > still running (now with no issues). > > > >3) Can you confirm that 1.4.20 isn't *crashing*, but is > actually exhibiting write failures? > correct > > we will enable saving stderr to log. may be this can show something. If > you have any other ideas - let me know. > > -denis > > > > On Wednesday, July 2, 2014 1:36:57 PM UTC-7, Dormando wrote: >> >> Cool. That is disappointing. >> >> Can you clarify a few things for me: >> >> 1) You're saying that you were getting OOM's on slab 13, but it recovered >> on its own? This is under version 1.4.20 and you did *not* enable tail >> repairs? >> >> 2) Can you share (with me at least) the full stats/stats items/stats >> slabs >> output from one of the affected servers running 1.4.20? >> >> 3) Can you confirm that 1.4.20 isn't *crashing*, but is actually >> exhibiting write failures? >> >> If it's not a crash, and your hash power level isn't expanding, I don't >> think it's related to the other bug. >> >> thanks! >> >> On Wed, 2 Jul 2014, Denis Samoylov wrote: >> >> > Dormando, sure, we will add option to preset hashtable. (as i see nn >> should be 26). >> > One question: as i see in logs for the servers there is no change >> for hash_power_level before incident (it would be hard to say for crushed >> but .20 >> > just had outofmemory and i have solid stats). Does not this contradict >> the idea of cause? Server had hash_power_level = 26 for days before and >> > still has 26 days after. Just for three hours every set for slab 13 >> failed. We did not reboot/flush server and it continues to work without >> > problem. What do you think? >> > >> > On Tuesday, July 1, 2014 2:43:49 PM UTC-7, Dormando wrote: >> > Hey, >> > >> > Can you presize the hash table? (-o hashpower=nn) to be large >> enough on >> > those servers such that hash expansion won't happen at runtime? >> You can >> > see what hashpower is on a long running server via stats to know >> what to >> > set the value to. >> > >> > If that helps, we might still have a bug in hash expansion. I see >> someone >> > finally reproduced a possible issue there under .20. .17/.19 fix >> other >> > causes of the problem pretty thoroughly though. >> > >> > On Tue, 1 Jul 2014, Denis Samoylov wrote: >> > >> > > Hi, >> > > We had sporadic memory corruption due tail repair in pre .20 >> version. So we updated some our servers to .20. This Monday we observed >> > several >> > > crushes in .15 version and tons of "allocation failure" in .20 >> version. This is expected as .20 just disables "tail repair" but it >> > seems the >> > > problem is still there. What is interesting: >> > > 1) there is no visible change in traffic and only one slab is >> affected usually. >> > > 2) this always happens with several but not all servers :) >> > > >> > > Is there any way to catch this and help with debug? I have all >> slab and item stats for the time around incident for .15 and .20 >> > version. .15 is >> > > clearly memory corruption: gdb shows that hash function >> returned 0 (line 115 uint32_t hv = hash(ITEM_key(search), search->nkey, >> 0);). >> > > >> > > so we seems hitting this comment: >> > > /* Old rare bug could cause a refcount leak. We >> haven't seen >> > > * it in years, but we leave this code in to >> prevent failures >> > > * just in case */ >> > > >> > > :) >> > > >> > > Thank you, >> > > Denis >> > > >> > > -- >> > > >> > > --- >> > > You received this message because you are subscribed to the >> Google Groups "memcached" group. >> > > To unsubscribe from this group and stop receiving emails from >> it, send an email to [email protected]. >> > > For more options, visit https://groups.google.com/d/optout. >> > > >> > > >> > >> > -- >> > >> > --- >> > You received this message because you are subscribed to the Google >> Groups "memcached" group. >> > To unsubscribe from this group and stop receiving emails from it, send >> an email to [email protected]. >> > For more options, visit https://groups.google.com/d/optout. >> > >> > > > -- > > --- > You received this message because you are subscribed to the Google Groups > "memcached" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > -- --- You received this message because you are subscribed to the Google Groups "memcached" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
