Updates:
Status: Accepted
Comment #15 on issue 363 by [email protected]: MemcachePool::get(): Server
127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
http://code.google.com/p/memcached/issues/detail?id=363
So it turned out that c->state was set to "conn_closed".
Looking at this a bit more closely:
conn_closed is only ever set from conn_close().
conn_close() is only called from two spots: once, before drive_machine() is
entered (and early returned from), and once from within drive_machine(),
directly above the conn_closed case and with a stop/break before it:
case conn_closing:
if (IS_UDP(c->transport))
conn_cleanup(c);
else
conn_close(c);
stop = true;
break;
case conn_closed:
... conn_close() always deletes the event from the stack, closes the
filehandle, etc.
So, I don't see how this could happen... yet...
None of the code changed between 1.4.17 or 1.4.18 seems to add new paths
which could cause a connection to re-fire.
If conn_close() is called there's no way for that to loop again (stop =
true).
What would have to happen is the event_handler firing again, on the closed
connection, which the fd for is closed, the event is deleted, but the
memory not reused just yet... It would then enter drive_machine with the
state already set to conn_closing, and never trip a stop = true, and not
assert since it's not a debug binary.
Which is fucking terrifying. If this happened in the old code it'd just
keep running into conn_closing and re-closing itself. though I was pretty
sure that calling event_del() twice causes a crash.
The other possibility is that a UDP socket is getting closed... except that
also deletes the event, and closes the socket, so no new events should
happen.
I did a quick test and added a second conn_close() call and.... it doesn't
cause a crash. it causes the curr_connections counter to slowly drift. That
is wild.
I just pushed:
https://github.com/memcached/memcached/commit/ee961e456457728ba78057961eca357edaea1ec1
...up to master.
I'm still a bit suspicious that I'm missing something important here... so
I'm not doing a release tonight, but I might do one early tomorrow anyway.
Reporter is currently running a version of this patch in production; if his
thing hangs up again, or doesn't self-recover once hitting the condition,
we'll have a better idea I guess.
--
You received this message because this project is configured to send all
issue notifications to this address.
You may adjust your notification preferences at:
https://code.google.com/hosting/settings
--
---
You received this message because you are subscribed to the Google Groups "memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.