[
https://issues.apache.org/jira/browse/TS-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13468179#comment-13468179
]
Alan M. Carroll commented on TS-1487:
-------------------------------------
TS-1487
Fix proposal:
1) Add a new initialization function init_HttpProxyServerSockets which would
open all of the sockets without starting threads or listening on the sockets.
This can the be called to provide a window between opening the sockets and
listening on them for plugin initialization.
2) Add a new eventing mechanism for plugins to catch specific ATS level events.
A plugin would make an API call to register a callback continuation which would
be invoked for the following event:
PORTS_OPEN : sockets for listen ports are open.
CACHE_RUNNING : cache is now operational
It is suggested that we may want to expand this to include SHUTDOWN and
RECONFIGURE. An alternative would be have a potentially different callback per
event in the style of TSHttpHookAdd, e.g. TSAtsHookAdd(TSAtsHookID, TSCont).
("TSSystemHookAdd"?).
3) Plugins would then be initialized as early as possible, which means calling
TSPluginInit function as early as possible. Plugins that need to perform
operations at some later point in the ATS lifecyle (e.g., after sockets are
opened) would set a hook during TSPlugInit and perform the operation in the
callback. It should be noted that for sockets we cannot guarantee calling the
plugin before the sockets are open as that may happen even before tbe
traffic_server process is started. We can only promise that when the
SOCKET_OPEN callback is invoked, the sockets are open.
This provides a very general mechanism which should be relatively
straightforward to use and implement and avoids a configuration variable
(always a feature!). If we find in the future additional lifecycle points at
which a plugin needs to perform operations these can be added in a fully
backwards compatible manner. The SPDY plugin would need to be updated but that
is AFAIK the only plugin that currently is dependent on this ordering. This
would represent a change from 3.2 but a reversion to the 3.0.X behavior with
regard to when plugins are initialized which I think is acceptable.
We would still need an additional configuration variable to control the
ordering of listen thread startup and cache readiness. With the socket opening
split off as per (1) this would be relatively easy to implement. The primary
question would be whether main() should just call start_HttpProxyServer and
pass an event code, or check the event code itself and conditionally call.
> the ordering of plugin_init and init_HttpProxyServer cause crashed TS to core
> endlessly
> ---------------------------------------------------------------------------------------
>
> Key: TS-1487
> URL: https://issues.apache.org/jira/browse/TS-1487
> Project: Traffic Server
> Issue Type: Bug
> Components: Core
> Affects Versions: 3.2.0
> Environment: Linux RHEL6.2
> Reporter: Aidan McGurn
> Assignee: Alan M. Carroll
> Priority: Critical
> Attachments: INTD-529-RespawnCrash.patch, INTD-529-RespawnCrash.patch
>
>
> We've had a serious issue whereby the TS when it crashes re-spawns/cores
> continuously when its tries to re-start under load. I traced the issue to
> SNMP research library (a third party lib)- They use selects and what happens
> is the file descriptor number spikes under load after the crash as all the
> sockets get opened at once - this causes buffer overflow in the select (which
> their library is full of) as the fd allocated to the FD_SET is much bigger
> than the FD_SETSIZE of 1024 (which was a bitch to track down as the stack
> was corrupted and gdb therefore useless). Tracing why this happened on 3.2.0
> and not 3.0.2, I find the sequence
> of the plugin_init has changed - On 3.0.2 the sequence was in effect 1.
> plugin_init and then 2. init_HttpProxyServer. Whereas this has mysteriously
> been reversed on 3.2.0. In order to get our system to work in this crash case
> , I've patched ATS to flip them around like in 3.0.2.
> i'll attach the patch we propose we need to use to get around this.
> Is this actually a bug then waiting to happen in other systems - Or was there
> a reason to change this sequence?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira