Re: Issue 363 in memcached: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout

dormando Thu, 08 May 2014 15:19:42 -0700

To that note, it *is* useful if you try that branch I posted, since so far
as I can tell that should emulate the .17 behavior.


On Thu, 8 May 2014, dormando wrote:

> > I am just speculating, and by no means have any idea what I am really 
> > talking about here. :)
> > With 2 threads, still solid, no timeouts, no runaway 100% cpu. Its been 
> > days. Increasing from 2 threads to 4 does not generate any more traffic or
> > requests to memcached. Thus I am speculating perhaps it is a race-condition 
> > or some sort, only hitting with > 2 threads.
>
> Doesn't tell me anything useful, since I'm already looking for potential
> races and don't see any possibility outside of libevent.
>
> > Why do you say it will be less likely to happen with 2 threads than 4?
>
> Nature of race conditions: the more threads you have running the more
> likely you are to hit them, sometimes on order of magnitudes.
>
> It doesn't really change the fact that this has worked for many years and
> the code *barely* changed recently. I just don't see it.
>
> > On Wednesday, May 7, 2014 5:38:47 PM UTC-7, Dormando wrote:
> >       That doesn't really tell us anything about the nature of the problem
> >       though. With 2 threads it might still happen, but is a lot less 
> > likely.
> >
> >       On Wed, 7 May 2014, [email protected] wrote:
> >
> >       > Bumped up to 2 threads and so far no timeout errors. I'm going to 
> > let it run for a few more days, then revert back to 4 threads and
> >       see if timeout
> >       > errors come up again. That will tell us the problem lies in 
> > spawning more than 2 threads.
> >       >
> >       > On Wednesday, May 7, 2014 5:19:13 PM UTC-7, Dormando wrote:
> >       >       Hey,
> >       >
> >       >       try this branch:
> >       >       https://github.com/dormando/memcached/tree/double_close
> >       >
> >       >       so far as I can tell that emulates the behavior in .17...
> >       >
> >       >       to build:
> >       >       ./autogen.sh && ./configure && make
> >       >
> >       >       run it in screen like you were doing with the other tests, 
> > see if it
> >       >       prints "ERROR: Double Close [somefd]". If it prints that once 
> > then stops,
> >       >       I guess that's what .17 was doing... if it print spams, then 
> > something
> >       >       else may have changed.
> >       >
> >       >       I'm mostly convinced something about your OS or build is 
> > corrupt, but I
> >       >       have no idea what it is. The only other thing I can think of 
> > is to
> >       >       instrument .17 a bit more and have you try that (with the 
> > connection code
> >       >       laid out the old way, but with a conn_closed flag to detect a 
> > double close
> >       >       attempt), and see if the old .17 still did it.
> >       >
> >       >       On Tue, 6 May 2014, [email protected] wrote:
> >       >
> >       >       > Changing from 4 threads to 1 seems to have resolved the 
> > problem. No timeouts since. Should I set to 2 threads and wait and
> >       see how
> >       >       things go?
> >       >       >
> >       >       > On Tuesday, May 6, 2014 12:07:08 AM UTC-7, Dormando wrote:
> >       >       >       and how'd that work out?
> >       >       >
> >       >       >       Still no other reports :/ a few thousand more 
> > downloads of .19...
> >       >       >
> >       >       >       On Sun, 4 May 2014, [email protected] wrote:
> >       >       >
> >       >       >       > I'm going to try switching threads from 4 to 1. 
> > This host web2 is on the only one I am seeing it on, but it also is
> >       the only
> >       >       hosts
> >       >       >       that gets any
> >       >       >       > real traffic. Super frustrating.
> >       >       >       >
> >       >       >       > On Sunday, May 4, 2014 10:12:08 AM UTC-7, Dormando 
> > wrote:
> >       >       >       >       I'm stumped. (also, your e-mails aren't 
> > updating the ticket...).
> >       >       >       >
> >       >       >       >       It's impossible for a connection to get into 
> > the closed state without
> >       >       >       >       having event_del() and close() called on the 
> > socket. A socket slot isn't
> >       >       >       >       event_add()'ed again until after the state is 
> > reset to 'init_state'.
> >       >       >       >
> >       >       >       >       There was no code path for event_del to 
> > actually fail so far as I could
> >       >       >       >       see.
> >       >       >       >
> >       >       >       >       I've e-mailed steven grimm for ideas but 
> > either that's not his e-mail
> >       >       >       >       anymore or he's not going to respond.
> >       >       >       >
> >       >       >       >       I really don't know. I guess the old code 
> > would've just called conn_close
> >       >       >       >       again by accident... I don't see how the 
> > logic changed in any significant
> >       >       >       >       way in .18. Though again, if it happened with 
> > any frequency people's
> >       >       >       >       curr_conns stat would go negative.
> >       >       >       >
> >       >       >       >       So... either that always happened and we 
> > never noticed, or your particular
> >       >       >       >       OS is corrupt. There're probably 10,000+ 
> > installs of .18+ now and only one
> >       >       >       >       complaint, so I'm a little hesitant to spend 
> > a ton of time on this until
> >       >       >       >       we get more reports.
> >       >       >       >
> >       >       >       >       You should downgrade to .17.
> >       >       >       >
> >       >       >       >       On Sun, 4 May 2014, [email protected] 
> > wrote:
> >       >       >       >
> >       >       >       >       > Damn it, got network timeout. CPU 3 is 
> > using 100% cpu from memcached.
> >       >       >       >       > Here is the result of stat to verify using 
> > new version of memcached and libevent:
> >       >       >       >       >
> >       >       >       >       > STAT version 1.4.19
> >       >       >       >       > STAT libevent 2.0.18-stable
> >       >       >       >       >
> >       >       >       >       >
> >       >       >       >       > On Saturday, May 3, 2014 11:55:31 PM UTC-7, 
> > [email protected] wrote:
> >       >       >       >       >       Just upgraded all 5 web-servers to 
> > memcached 1.4.19 with libevent 2.0.18. Will advise if I see
> >       memcached
> >       >       timeouts.
> >       >       >       Should be
> >       >       >       >       good
> >       >       >       >       >       though.
> >       >       >       >       >
> >       >       >       >       > Thanks so much for all the help and 
> > patience. Really appreciated.
> >       >       >       >       >
> >       >       >       >       > On Friday, May 2, 2014 10:20:26 PM UTC-7, 
> > [email protected] wrote:
> >       >       >       >       >       Updates:
> >       >       >       >       >               Status: Invalid
> >       >       >       >       >
> >       >       >       >       >       Comment #20 on issue 363 by 
> > [email protected]: MemcachePool::get(): Server  
> >       >       >       >       >       127.0.0.1 (tcp 11211, udp 0) failed 
> > with: Network timeout
> >       >       >       >       >       
> > http://code.google.com/p/memcached/issues/detail?id=363
> >       >       >       >       >
> >       >       >       >       >       Any repeat crashes? I'm going to 
> > close this. it looks like remi  
> >       >       >       >       >       shipped .19. reopen or open a new one 
> > if it hangs in the same way somehow...
> >       >       >       >       >
> >       >       >       >       >       Well. 19 won't be printing anything, 
> > and it won't hang, but if it's  
> >       >       >       >       >       actually our bug and not libevent it 
> > would end up spinning CPU. Keep an eye  
> >       >       >       >       >       out I guess.
> >       >       >       >       >
> >       >       >       >       >       --
> >       >       >       >       >       You received this message because 
> > this project is configured to send all  
> >       >       >       >       >       issue notifications to this address.
> >       >       >       >       >       You may adjust your notification 
> > preferences at:
> >       >       >       >       >       
> > https://code.google.com/hosting/settings
> >       >       >       >       >
> >       >       >       >       > --
> >       >       >       >       >
> >       >       >       >       > ---
> >       >       >       >       > You received this message because you are 
> > subscribed to the Google Groups "memcached" group.
> >       >       >       >       > To unsubscribe from this group and stop 
> > receiving emails from it, send an email to
> >       [email protected].
> >       >       >       >       > For more options, visit 
> > https://groups.google.com/d/optout.
> >       >       >       >       >
> >       >       >       >       >
> >       >       >       >
> >       >       >       > --
> >       >       >       >
> >       >       >       > ---
> >       >       >       > You received this message because you are 
> > subscribed to the Google Groups "memcached" group.
> >       >       >       > To unsubscribe from this group and stop receiving 
> > emails from it, send an email to [email protected].
> >       >       >       > For more options, visit 
> > https://groups.google.com/d/optout.
> >       >       >       >
> >       >       >       >
> >       >       >
> >       >       > --
> >       >       >
> >       >       > ---
> >       >       > You received this message because you are subscribed to the 
> > Google Groups "memcached" group.
> >       >       > To unsubscribe from this group and stop receiving emails 
> > from it, send an email to [email protected].
> >       >       > For more options, visit https://groups.google.com/d/optout.
> >       >       >
> >       >       >
> >       >
> >       > --
> >       >
> >       > ---
> >       > You received this message because you are subscribed to the Google 
> > Groups "memcached" group.
> >       > To unsubscribe from this group and stop receiving emails from it, 
> > send an email to [email protected].
> >       > For more options, visit https://groups.google.com/d/optout.
> >       >
> >       >
> >
> > --
> >
> > ---
> > You received this message because you are subscribed to the Google Groups 
> > "memcached" group.
> > To unsubscribe from this group and stop receiving emails from it, send an 
> > email to [email protected].
> > For more options, visit https://groups.google.com/d/optout.
> >
> >
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups 
> "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Issue 363 in memcached: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout

Reply via email to