Re: tail repair issue (1.4.20)

Zhiwei Chan Wed, 02 Jul 2014 20:27:16 -0700

  Hi,
   with the hash power 26, slab 13, that means (2**26)*1.5*1488=142G memory
is needed. Could you please put the stats info to this thread or send a
copy for me too?  And,  is that " tons of 'allocation failure' " the system
log or the "outofmemory" statistic in memcached? At last, i think generate
a core file withe debug-info is great helpful.
  Thanks.



2014-07-03 6:43 GMT+08:00 Denis Samoylov <[email protected]>:

> >1)  OOM's on slab 13, but it recovered on its own? This is under version
> 1.4.20 and you did *not* enable tail repairs?
> correct
>
>
> >2) Can you share (with me at least) the full stats/stats items/stats
> slabs output from one of the affected servers running 1.4.20?
> sent you _current_ stats from the server that had OOM couple days ago and
> still running (now with no issues).
>
>
> >3) Can you confirm that 1.4.20 isn't *crashing*, but is
> actually exhibiting write failures?
> correct
>
> we will enable saving stderr to log. may be this can show something. If
> you have any other ideas - let me know.
>
> -denis
>
>
>
> On Wednesday, July 2, 2014 1:36:57 PM UTC-7, Dormando wrote:
>>
>> Cool. That is disappointing.
>>
>> Can you clarify a few things for me:
>>
>> 1) You're saying that you were getting OOM's on slab 13, but it recovered
>> on its own? This is under version 1.4.20 and you did *not* enable tail
>> repairs?
>>
>> 2) Can you share (with me at least) the full stats/stats items/stats
>> slabs
>> output from one of the affected servers running 1.4.20?
>>
>> 3) Can you confirm that 1.4.20 isn't *crashing*, but is actually
>> exhibiting write failures?
>>
>> If it's not a crash, and your hash power level isn't expanding, I don't
>> think it's related to the other bug.
>>
>> thanks!
>>
>> On Wed, 2 Jul 2014, Denis Samoylov wrote:
>>
>> > Dormando, sure, we will add option to preset hashtable. (as i see nn
>> should be 26).
>> > One question: as i see in logs for the servers there is no change
>> for hash_power_level before incident (it would be hard to say for crushed
>> but .20
>> > just had outofmemory and i have solid stats). Does not this contradict
>> the idea of cause? Server had hash_power_level = 26 for days before and
>> > still has 26 days after. Just for three hours every set for slab 13
>> failed. We did not reboot/flush server and it continues to work without
>> > problem. What do you think?
>> >
>> > On Tuesday, July 1, 2014 2:43:49 PM UTC-7, Dormando wrote:
>> >       Hey,
>> >
>> >       Can you presize the hash table? (-o hashpower=nn) to be large
>> enough on
>> >       those servers such that hash expansion won't happen at runtime?
>> You can
>> >       see what hashpower is on a long running server via stats to know
>> what to
>> >       set the value to.
>> >
>> >       If that helps, we might still have a bug in hash expansion. I see
>> someone
>> >       finally reproduced a possible issue there under .20. .17/.19 fix
>> other
>> >       causes of the problem pretty thoroughly though.
>> >
>> >       On Tue, 1 Jul 2014, Denis Samoylov wrote:
>> >
>> >       > Hi,
>> >       > We had sporadic memory corruption due tail repair in pre .20
>> version. So we updated some our servers to .20. This Monday we observed
>> >       several
>> >       > crushes in .15 version and tons of "allocation failure" in .20
>> version. This is expected as .20 just disables "tail repair" but it
>> >       seems the
>> >       > problem is still there. What is interesting:
>> >       > 1) there is no visible change in traffic and only one slab is
>> affected usually.
>> >       > 2) this always happens with several but not all servers :)
>> >       >
>> >       > Is there any way to catch this and help with debug? I have all
>> slab and item stats for the time around incident for .15 and .20
>> >       version. .15 is
>> >       > clearly memory corruption: gdb shows that hash function
>> returned 0 (line 115 uint32_t hv = hash(ITEM_key(search), search->nkey,
>> 0);).
>> >       >
>> >       > so we seems hitting this comment:
>> >       >             /* Old rare bug could cause a refcount leak. We
>> haven't seen
>> >       >              * it in years, but we leave this code in to
>> prevent failures
>> >       >              * just in case */
>> >       >
>> >       > :)
>> >       >
>> >       > Thank you,
>> >       > Denis
>> >       >
>> >       > --
>> >       >
>> >       > ---
>> >       > You received this message because you are subscribed to the
>> Google Groups "memcached" group.
>> >       > To unsubscribe from this group and stop receiving emails from
>> it, send an email to [email protected].
>> >       > For more options, visit https://groups.google.com/d/optout.
>> >       >
>> >       >
>> >
>> > --
>> >
>> > ---
>> > You received this message because you are subscribed to the Google
>> Groups "memcached" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> an email to [email protected].
>> > For more options, visit https://groups.google.com/d/optout.
>> >
>> >
>
>  --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: tail repair issue (1.4.20)

Reply via email to