[
https://issues.apache.org/jira/browse/TS-4161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180000#comment-15180000
]
ASF GitHub Bot commented on TS-4161:
------------------------------------
Github user gtenev commented on the pull request:
https://github.com/apache/trafficserver/pull/496#issuecomment-192317642
Sure, will remove the `malloc` failure check. We could increase the limit
much higher or remove it altogether.
With the previous code in the really misconfigured corner case described in
TS-4161 in my environment the traffic_manager pumped non-stop 6628 messages
1986000B total for 1.85s until it overflowed the stack and crashed.
It kept enqueuing messages but never got a chance to process them. Since
the current implementation has no flow control it seemed like a good idea to
put a limit to give a chance to the processing to catch up (if possible) or to
at least pause the ever-growing allocation.
But this was a misconfigured corner case, wonder how the "heaviest normal"
usage can be measured to set a reasonable limit or if I should remove it
altogether (open to suggestions).
> ProcessManager prone to stack-overflow
> --------------------------------------
>
> Key: TS-4161
> URL: https://issues.apache.org/jira/browse/TS-4161
> Project: Traffic Server
> Issue Type: Bug
> Components: Manager
> Reporter: Gancho Tenev
> Assignee: Gancho Tenev
> Priority: Blocker
> Labels: crash
> Fix For: 6.2.0
>
>
> ProcessManager::pollLMConnection() can get "stuck" in a loop while handling
> big number of messages in a row from the same socket.
> Since alloca() is used to allocate buffers on the stack for each message read
> from the socket, and those buffers are not released until the function
> returns, getting "stuck" in the loop can lead to stack-overflow, fwiw same
> could happen if the message length is big enough (accidentally or on purpose).
> It can be reproduced easily by setting up:
> proxy.config.lm.pserver_timeout_secs: 0
> proxy.config.lm.pserver_timeout_msecs: 0
> in records.config and running ./bin/traffic_manager.
> ATS crashes with a segfault in a weird place (while trying to allocate with
> malloc()). If you inspect the core you would see that it got "stuck" in the
> loop before it crashed over-flowing the stack (kept allocating buffers on the
> stack with alloca() until it crashed).
> It is worth considering replacing the alloca() with VLA (which "releases"
> memory when out of scope on each iteration of the loop) or using ats_malloc()
> which is supposedly less time-efficient but would be better to handle bigger
> messages without worrying about stack-overflow.
> IMO adding a message size limit check is a good practice especially with the
> current implementation.
> If the code gets "stuck" in the while loop while reading big number of
> messages in a row from the same socket then the port configured by
> proxy.config.process_manager.mgmt_port becomes unavailable (connection
> refused). Adding a limit of messages that can be processed in a row should be
> a good idea.
> I stumbled up on this while running TSQA regression tests where TSQA kept
> complaining that the management port is not available and the ATS kept
> crashing.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)