Gancho Tenev created TS-4161:
--------------------------------
Summary: ProcessManager prone to stack-overflow
Key: TS-4161
URL: https://issues.apache.org/jira/browse/TS-4161
Project: Traffic Server
Issue Type: Bug
Components: Manager
Reporter: Gancho Tenev
ProcessManager::pollLMConnection() can get "stuck" in a loop while handling big
number of messages in a raw from the same socket.
Since alloca() is used to allocate buffers on the stack for each message read
from the socket, and those buffers are not released until the function returns,
getting "stuck" in the loop can lead to stack-overflow, fwiw same could happen
if the message length is big enough (accidentally or on purpose).
It can be reproduced easily by setting up:
proxy.config.lm.pserver_timeout_secs: 0
proxy.config.lm.pserver_timeout_msecs: 0
in records.config and running ./bin/traffic_manager.
ATS crashes with a segfault in a weird place (while trying to allocate with
malloc()). If you inspect the core you would see that it got "stuck" in the
loop before it crashed over-flowing the stack (kept allocating buffers on the
stack with alloca() until it crashed).
It is worth considering replacing the alloca() with VLA (which "releases"
memory when out of scope on each iteration of the loop) or using ats_malloc()
which is supposedly less time-efficient but would be better to handle bigger
messages without worrying about stack-overflow.
IMO adding a message size limit check is a good practice especially with the
current implementation.
If the code gets "stuck" in the while loop while reading big number of messages
in a row from the same socket then the port configured by
proxy.config.process_manager.mgmt_port becomes unavailable (connection
refused). Adding a limit of messages that can be processed in a row should be a
good idea.
I stumbled up on this while running TSQA regression tests where TSQA kept
complaining that the management port is not available and the ATS kept crashing.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)