[
https://issues.apache.org/jira/browse/TS-3871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Susan Hinrichs updated TS-3871:
-------------------------------
Attachment: ts-3871.diff
> VC Migration Can Lose Events
> ----------------------------
>
> Key: TS-3871
> URL: https://issues.apache.org/jira/browse/TS-3871
> Project: Traffic Server
> Issue Type: Bug
> Components: HTTP
> Reporter: Susan Hinrichs
> Assignee: Susan Hinrichs
> Attachments: ts-3871.diff
>
>
> Found this in my stress testing. Sometimes the POST or GET response is
> completely empty. No header and no body. The packet capture shows that ATS
> closes the connection 70 seconds after the last POST or GET of the connection
> was received. This corresponds to the
> proxy.config.http.keep_alive_no_activity_timeout_in on my test box.
> I moved from global pool to local pool and the problem went away.
> I eventually tracked it down to a problem in the epoll update. ep.start()
> during the migration would fail sometimes with EEXIST error. This means that
> the file descriptor is already associated with the epoll. If we are
> migrating from thread A to thread B this should not be the case. Unless we
> when from thread B to thread A and back to thread B without cleaning up the
> original thread B epoll. If this is happening, then multiple threads will be
> processing network events which seems like a recipe for disaster and dropped
> events.
> Originally, I left the ep.stop() which clears the epoll on the original
> thread's epoll structure to be done by the original thread. But under stress
> that seems to be a bad idea. Too much drift. With some more research, it
> appears that the epoll calls are thread safe.
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2006-03/msg00084.html
> I rearranged the code to do both the ep.stop() and ep.start() in the same
> migrating target thread, and my stress test had no more problems.
> I've run this patch on a production machine for over 12 hours with no crashes
> and no performance discrepancies. We will be expanding this testing.
> To repeat, this is not a problem we saw in production, but only in my "make
> it fall over" stress test.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)