[
https://issues.apache.org/jira/browse/TS-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13464820#comment-13464820
]
Leif Hedstrom commented on TS-1487:
-----------------------------------
I think the concept is reasonable, it seems to only affect startup, and nothing
after that, right ?
A few comments on the patch:
1) We probably shouldn't use pthread APIs / types directly. We have
abstractions for mutexes in ts/ts.h, and we should add condition / semaphores
abstractions as necessary to support this (and in general).
2) Maybe I'm missing it, but where is isProxyAllowedToStart intialized ? Also,
is it really necessary with the semaphore? Could we not, for example, just
pthread_cond_wait() here, and either wait for a plugin to pthread_cond_signal()
it to move along, and at the end of initializing all plugins, just
pthread_cond_broadcast it to unblock any thread blocking on it in case no
plugin let it move along. Maybe this breaks backward compatibility (it's an
inverse of what we do today), but I'm sure we can figure something out via e.g.
a records.config option :).
3) Why use timed conditional at all? Why not let it block forever? You have no
way out of it anyways, right ?
4) Instead of exposing the mutex and condition variable directly, lets abstract
this into one API call, e.g. TSAllowNetAccept() or some such.
Cheers,
> the ordering of plugin_init and init_HttpProxyServer cause crashed TS to core
> endlessly
> ---------------------------------------------------------------------------------------
>
> Key: TS-1487
> URL: https://issues.apache.org/jira/browse/TS-1487
> Project: Traffic Server
> Issue Type: Bug
> Components: Core
> Affects Versions: 3.2.0
> Environment: Linux RHEL6.2
> Reporter: Aidan McGurn
> Priority: Critical
> Attachments: INTD-529-RespawnCrash.patch, INTD-529-RespawnCrash.patch
>
>
> We've had a serious issue whereby the TS when it crashes re-spawns/cores
> continuously when its tries to re-start under load. I traced the issue to
> SNMP research library (a third party lib)- They use selects and what happens
> is the file descriptor number spikes under load after the crash as all the
> sockets get opened at once - this causes buffer overflow in the select (which
> their library is full of) as the fd allocated to the FD_SET is much bigger
> than the FD_SETSIZE of 1024 (which was a bitch to track down as the stack
> was corrupted and gdb therefore useless). Tracing why this happened on 3.2.0
> and not 3.0.2, I find the sequence
> of the plugin_init has changed - On 3.0.2 the sequence was in effect 1.
> plugin_init and then 2. init_HttpProxyServer. Whereas this has mysteriously
> been reversed on 3.2.0. In order to get our system to work in this crash case
> , I've patched ATS to flip them around like in 3.0.2.
> i'll attach the patch we propose we need to use to get around this.
> Is this actually a bug then waiting to happen in other systems - Or was there
> a reason to change this sequence?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira