Susan Hinrichs created TS-3871:
----------------------------------
Summary: VC Migration Can Lose Events
Key: TS-3871
URL: https://issues.apache.org/jira/browse/TS-3871
Project: Traffic Server
Issue Type: Bug
Components: HTTP
Reporter: Susan Hinrichs
Found this in my stress testing. Sometimes the POST or GET response is
completely empty. No header and no body. The packet capture shows that ATS
closes the connection 70 seconds after the last POST or GET of the connection
was received. This corresponds to the
proxy.config.http.keep_alive_no_activity_timeout_in on my test box.
I moved from global pool to local pool and the problem went away.
I eventually tracked it down to a problem in the epoll update. ep.start()
during the migration would fail sometimes with EEXIST error. This means that
the file descriptor is already associated with the epoll. If we are migrating
from thread A to thread B this should not be the case. Unless we when from
thread B to thread A and back to thread B without cleaning up the original
thread B epoll. If this is happening, then multiple threads will be processing
network events which seems like a recipe for disaster and dropped events.
Originally, I left the ep.stop() which clears the epoll on the original
thread's epoll structure to be done by the original thread. But under stress
that seems to be a bad idea. Too much drift. With some more research, it
appears that the epoll calls are thread safe.
http://linux.derkeiler.com/Mailing-Lists/Kernel/2006-03/msg00084.html
I rearranged the code to do both the ep.stop() and ep.start() in the same
migrating target thread, and my stress test had no more problems.
I've run this patch on a production machine for over 12 hours with no crashes
and no performance discrepancies. We will be expanding this testing.
To repeat, this is not a problem we saw in production, but only in my "make it
fall over" stress test.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)