Re: Issue 363 in memcached: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout

notifications Thu, 08 May 2014 14:16:31 -0700

I am just speculating, and by no means have any idea what I am really 
talking about here. :)


With 2 threads, still solid, no timeouts, no runaway 100% cpu. Its been 
days. Increasing from 2 threads to 4 does not generate any more traffic or 
requests to memcached. Thus I am speculating perhaps it is a race-condition 
or some sort, only hitting with > 2 threads.

Why do you say it will be less likely to happen with 2 threads than 4?

On Wednesday, May 7, 2014 5:38:47 PM UTC-7, Dormando wrote:
>
> That doesn't really tell us anything about the nature of the problem 
> though. With 2 threads it might still happen, but is a lot less likely. 
>
> On Wed, 7 May 2014, [email protected] <javascript:> wrote: 
>
> > Bumped up to 2 threads and so far no timeout errors. I'm going to let it 
> run for a few more days, then revert back to 4 threads and see if timeout 
> > errors come up again. That will tell us the problem lies in spawning 
> more than 2 threads. 
> > 
> > On Wednesday, May 7, 2014 5:19:13 PM UTC-7, Dormando wrote: 
> >       Hey, 
> > 
> >       try this branch: 
> >       https://github.com/dormando/memcached/tree/double_close 
> > 
> >       so far as I can tell that emulates the behavior in .17... 
> > 
> >       to build: 
> >       ./autogen.sh && ./configure && make 
> > 
> >       run it in screen like you were doing with the other tests, see if 
> it 
> >       prints "ERROR: Double Close [somefd]". If it prints that once then 
> stops, 
> >       I guess that's what .17 was doing... if it print spams, then 
> something 
> >       else may have changed. 
> > 
> >       I'm mostly convinced something about your OS or build is corrupt, 
> but I 
> >       have no idea what it is. The only other thing I can think of is to 
> >       instrument .17 a bit more and have you try that (with the 
> connection code 
> >       laid out the old way, but with a conn_closed flag to detect a 
> double close 
> >       attempt), and see if the old .17 still did it. 
> > 
> >       On Tue, 6 May 2014, [email protected] wrote: 
> > 
> >       > Changing from 4 threads to 1 seems to have resolved the problem. 
> No timeouts since. Should I set to 2 threads and wait and see how 
> >       things go? 
> >       > 
> >       > On Tuesday, May 6, 2014 12:07:08 AM UTC-7, Dormando wrote: 
> >       >       and how'd that work out? 
> >       > 
> >       >       Still no other reports :/ a few thousand more downloads of 
> .19... 
> >       > 
> >       >       On Sun, 4 May 2014, [email protected] wrote: 
> >       > 
> >       >       > I'm going to try switching threads from 4 to 1. This 
> host web2 is on the only one I am seeing it on, but it also is the only 
> >       hosts 
> >       >       that gets any 
> >       >       > real traffic. Super frustrating. 
> >       >       > 
> >       >       > On Sunday, May 4, 2014 10:12:08 AM UTC-7, Dormando 
> wrote: 
> >       >       >       I'm stumped. (also, your e-mails aren't updating 
> the ticket...). 
> >       >       > 
> >       >       >       It's impossible for a connection to get into the 
> closed state without 
> >       >       >       having event_del() and close() called on the 
> socket. A socket slot isn't 
> >       >       >       event_add()'ed again until after the state is 
> reset to 'init_state'. 
> >       >       > 
> >       >       >       There was no code path for event_del to actually 
> fail so far as I could 
> >       >       >       see. 
> >       >       > 
> >       >       >       I've e-mailed steven grimm for ideas but either 
> that's not his e-mail 
> >       >       >       anymore or he's not going to respond. 
> >       >       > 
> >       >       >       I really don't know. I guess the old code would've 
> just called conn_close 
> >       >       >       again by accident... I don't see how the logic 
> changed in any significant 
> >       >       >       way in .18. Though again, if it happened with any 
> frequency people's 
> >       >       >       curr_conns stat would go negative. 
> >       >       > 
> >       >       >       So... either that always happened and we never 
> noticed, or your particular 
> >       >       >       OS is corrupt. There're probably 10,000+ installs 
> of .18+ now and only one 
> >       >       >       complaint, so I'm a little hesitant to spend a ton 
> of time on this until 
> >       >       >       we get more reports. 
> >       >       > 
> >       >       >       You should downgrade to .17. 
> >       >       > 
> >       >       >       On Sun, 4 May 2014, [email protected] wrote: 
> >       >       > 
> >       >       >       > Damn it, got network timeout. CPU 3 is using 
> 100% cpu from memcached. 
> >       >       >       > Here is the result of stat to verify using new 
> version of memcached and libevent: 
> >       >       >       > 
> >       >       >       > STAT version 1.4.19 
> >       >       >       > STAT libevent 2.0.18-stable 
> >       >       >       > 
> >       >       >       > 
> >       >       >       > On Saturday, May 3, 2014 11:55:31 PM UTC-7, 
> [email protected] wrote: 
> >       >       >       >       Just upgraded all 5 web-servers to 
> memcached 1.4.19 with libevent 2.0.18. Will advise if I see memcached 
> >       timeouts. 
> >       >       Should be 
> >       >       >       good 
> >       >       >       >       though. 
> >       >       >       > 
> >       >       >       > Thanks so much for all the help and patience. 
> Really appreciated. 
> >       >       >       > 
> >       >       >       > On Friday, May 2, 2014 10:20:26 PM UTC-7, 
> [email protected] wrote: 
> >       >       >       >       Updates: 
> >       >       >       >               Status: Invalid 
> >       >       >       > 
> >       >       >       >       Comment #20 on issue 363 by 
> [email protected]: MemcachePool::get(): Server   
> >       >       >       >       127.0.0.1 (tcp 11211, udp 0) failed with: 
> Network timeout 
> >       >       >       >       
> http://code.google.com/p/memcached/issues/detail?id=363 
> >       >       >       > 
> >       >       >       >       Any repeat crashes? I'm going to close 
> this. it looks like remi   
> >       >       >       >       shipped .19. reopen or open a new one if 
> it hangs in the same way somehow... 
> >       >       >       > 
> >       >       >       >       Well. 19 won't be printing anything, and 
> it won't hang, but if it's   
> >       >       >       >       actually our bug and not libevent it would 
> end up spinning CPU. Keep an eye   
> >       >       >       >       out I guess. 
> >       >       >       > 
> >       >       >       >       -- 
> >       >       >       >       You received this message because this 
> project is configured to send all   
> >       >       >       >       issue notifications to this address. 
> >       >       >       >       You may adjust your notification 
> preferences at: 
> >       >       >       >       https://code.google.com/hosting/settings 
> >       >       >       > 
> >       >       >       > -- 
> >       >       >       > 
> >       >       >       > --- 
> >       >       >       > You received this message because you are 
> subscribed to the Google Groups "memcached" group. 
> >       >       >       > To unsubscribe from this group and stop 
> receiving emails from it, send an email to [email protected]. 
>
> >       >       >       > For more options, visit 
> https://groups.google.com/d/optout. 
> >       >       >       > 
> >       >       >       > 
> >       >       > 
> >       >       > -- 
> >       >       > 
> >       >       > --- 
> >       >       > You received this message because you are subscribed to 
> the Google Groups "memcached" group. 
> >       >       > To unsubscribe from this group and stop receiving emails 
> from it, send an email to [email protected]. 
> >       >       > For more options, visit 
> https://groups.google.com/d/optout. 
> >       >       > 
> >       >       > 
> >       > 
> >       > -- 
> >       > 
> >       > --- 
> >       > You received this message because you are subscribed to the 
> Google Groups "memcached" group. 
> >       > To unsubscribe from this group and stop receiving emails from 
> it, send an email to [email protected]. 
> >       > For more options, visit https://groups.google.com/d/optout. 
> >       > 
> >       > 
> > 
> > -- 
> > 
> > --- 
> > You received this message because you are subscribed to the Google 
> Groups "memcached" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to [email protected] <javascript:>. 
> > For more options, visit https://groups.google.com/d/optout. 
> > 
> >

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Issue 363 in memcached: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout

Reply via email to