Updates:
        Status: Accepted

Comment #15 on issue 363 by [email protected]: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
http://code.google.com/p/memcached/issues/detail?id=363

So it turned out that c->state was set to "conn_closed".

Looking at this a bit more closely:

conn_closed is only ever set from conn_close().

conn_close() is only called from two spots: once, before drive_machine() is entered (and early returned from), and once from within drive_machine(), directly above the conn_closed case and with a stop/break before it:

        case conn_closing:
            if (IS_UDP(c->transport))
                conn_cleanup(c);
            else
                conn_close(c);
            stop = true;
            break;

        case conn_closed:

... conn_close() always deletes the event from the stack, closes the filehandle, etc.

So, I don't see how this could happen... yet...

None of the code changed between 1.4.17 or 1.4.18 seems to add new paths which could cause a connection to re-fire.

If conn_close() is called there's no way for that to loop again (stop = true).

What would have to happen is the event_handler firing again, on the closed connection, which the fd for is closed, the event is deleted, but the memory not reused just yet... It would then enter drive_machine with the state already set to conn_closing, and never trip a stop = true, and not assert since it's not a debug binary.

Which is fucking terrifying. If this happened in the old code it'd just keep running into conn_closing and re-closing itself. though I was pretty sure that calling event_del() twice causes a crash.

The other possibility is that a UDP socket is getting closed... except that also deletes the event, and closes the socket, so no new events should happen.

I did a quick test and added a second conn_close() call and.... it doesn't cause a crash. it causes the curr_connections counter to slowly drift. That is wild.

I just pushed: https://github.com/memcached/memcached/commit/ee961e456457728ba78057961eca357edaea1ec1

...up to master.

I'm still a bit suspicious that I'm missing something important here... so I'm not doing a release tonight, but I might do one early tomorrow anyway.

Reporter is currently running a version of this patch in production; if his thing hangs up again, or doesn't self-recover once hitting the condition, we'll have a better idea I guess.

--
You received this message because this project is configured to send all issue notifications to this address.
You may adjust your notification preferences at:
https://code.google.com/hosting/settings

--

--- You received this message because you are subscribed to the Google Groups "memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to