It looks like just replacing the do_item_unlink_nolock to do_item_unlink can fix it. I am going to create a new issue and try to pull a patch.
-----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of dormando Sent: Friday, July 04, 2014 3:10 AM To: [email protected] Subject: Re: slab re-balance seems not thread-safty Seems like you're right.. I'd re-arranged where the LRU lock (cache_lock) is called then forgot to update that one bit. Most of the do_item_unlink code is safe there, until it gets into the LRU bits. It's unlikely anyone actually saw a crash from this as it's a narrow race though. That's easy to fix. It's still necessary to delete it, since threads can stack around a handful of objects and cause rebalance to hang. Thanks! On Thu, 3 Jul 2014, Zhiwei Chan wrote: > the item lock can only protect the hash list, but what about the LRU > list? As far as i know, if trying to delete a node from a doubly-linked-list, > it is necessary to lock at least 3 node: node, node->pre, node->next. I will > try to check if it may crash the LRU list in gdb next week . > And I think in do_item_get it is not necessary to delete the item that is > re-balanced, just leave it there and return NULL seems better. > > 在 2014年7月3日星期四UTC+8下午1时30分29秒,Dormando写道: > the item lock is already held for that key when do_item_get is called, > which is why the nolock code is called there. > > slab rebalance has that second short-circuiting of fetches to ensure > very > hot items don't permanently jam a page move. > > On Wed, 2 Jul 2014, Zhiwei Chan wrote: > > > Hi all, I have thought carefully about the the thread-safe > memcached recently, and found that if the re-balance is running, it may > not > > thread-safety. The code "do_item_get->do_item_unlink_nolock" may > corrupt the hash table. Whenever it trying to modify the hash table, > it should get > > cache_lock, but the function do_item_get have not got the cache_lock. > > Please tell me if anything i neglected. > > > > /** wrapper around assoc_find which does the lazy expiration logic */ > > item *do_item_get(const char *key, const size_t nkey, const uint32_t > hv) { > > //mutex_lock(&cache_lock); > > item *it = assoc_find(key, nkey, hv); > > if (it != NULL) { > > refcount_incr(&it->refcount); > > /* Optimization for slab reassignment. prevents popular items > from > > * jamming in busy wait. Can only do this here to satisfy > lock order > > * of item_lock, cache_lock, slabs_lock. */ > > if (slab_rebalance_signal && > > ((void *)it >= slab_rebal.slab_start && (void *)it < > slab_rebal.slab_end)) { > > do_item_unlink_nolock(it, hv); > -------------------------------------------------------------------> no lock > before > unlink. > > do_item_remove(it); > > it = NULL; > > } > > } > > > > -- > > > > --- > > You received this message because you are subscribed to the Google > Groups "memcached" group. > > To unsubscribe from this group and stop receiving emails from it, > send an email to [email protected]. > > For more options, visit https://groups.google.com/d/optout. > > > > > > -- > > --- > You received this message because you are subscribed to the Google Groups > "memcached" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > > -- --- You received this message because you are subscribed to the Google Groups "memcached" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups "memcached" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
