[ 
https://issues.apache.org/jira/browse/TS-4161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gancho Tenev updated TS-4161:
-----------------------------
    Description: 
ProcessManager::pollLMConnection() can get "stuck" in a loop while handling big 
number of messages in a row from the same socket. 

Since alloca() is used to allocate buffers on the stack for each message read 
from the socket, and those buffers are not released until the function returns, 
getting "stuck" in the loop can lead to stack-overflow, fwiw same could happen 
if the message length is big enough (accidentally or on purpose).

It can be reproduced easily by setting up:
            proxy.config.lm.pserver_timeout_secs: 0
            proxy.config.lm.pserver_timeout_msecs: 0
in records.config and running ./bin/traffic_manager. 

ATS crashes with a segfault in a weird place (while trying to allocate with 
malloc()). If you inspect the core you would see that it got "stuck" in the 
loop before it crashed over-flowing the stack (kept allocating buffers on the 
stack with alloca() until it crashed).

It is worth considering replacing the alloca() with VLA (which "releases" 
memory when out of scope on each iteration of the loop) or using ats_malloc() 
which is supposedly less time-efficient but would be better to handle bigger 
messages without worrying about stack-overflow. 

IMO adding a message size limit check is a good practice especially with the 
current implementation.

If the code gets "stuck" in the while loop while reading big number of messages 
in a row from the same socket then the port configured by 
proxy.config.process_manager.mgmt_port becomes unavailable (connection 
refused). Adding a limit of messages that can be processed in a row should be a 
good idea.

I stumbled up on this while running TSQA regression tests where TSQA kept 
complaining that the management port is not available and the ATS kept crashing.

  was:
ProcessManager::pollLMConnection() can get "stuck" in a loop while handling big 
number of messages in a raw from the same socket. 

Since alloca() is used to allocate buffers on the stack for each message read 
from the socket, and those buffers are not released until the function returns, 
getting "stuck" in the loop can lead to stack-overflow, fwiw same could happen 
if the message length is big enough (accidentally or on purpose).

It can be reproduced easily by setting up:
            proxy.config.lm.pserver_timeout_secs: 0
            proxy.config.lm.pserver_timeout_msecs: 0
in records.config and running ./bin/traffic_manager. 

ATS crashes with a segfault in a weird place (while trying to allocate with 
malloc()). If you inspect the core you would see that it got "stuck" in the 
loop before it crashed over-flowing the stack (kept allocating buffers on the 
stack with alloca() until it crashed).

It is worth considering replacing the alloca() with VLA (which "releases" 
memory when out of scope on each iteration of the loop) or using ats_malloc() 
which is supposedly less time-efficient but would be better to handle bigger 
messages without worrying about stack-overflow. 

IMO adding a message size limit check is a good practice especially with the 
current implementation.

If the code gets "stuck" in the while loop while reading big number of messages 
in a row from the same socket then the port configured by 
proxy.config.process_manager.mgmt_port becomes unavailable (connection 
refused). Adding a limit of messages that can be processed in a row should be a 
good idea.

I stumbled up on this while running TSQA regression tests where TSQA kept 
complaining that the management port is not available and the ATS kept crashing.


> ProcessManager prone to stack-overflow
> --------------------------------------
>
>                 Key: TS-4161
>                 URL: https://issues.apache.org/jira/browse/TS-4161
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Manager
>            Reporter: Gancho Tenev
>            Assignee: Gancho Tenev
>              Labels: crash
>             Fix For: 6.2.0
>
>
> ProcessManager::pollLMConnection() can get "stuck" in a loop while handling 
> big number of messages in a row from the same socket. 
> Since alloca() is used to allocate buffers on the stack for each message read 
> from the socket, and those buffers are not released until the function 
> returns, getting "stuck" in the loop can lead to stack-overflow, fwiw same 
> could happen if the message length is big enough (accidentally or on purpose).
> It can be reproduced easily by setting up:
>             proxy.config.lm.pserver_timeout_secs: 0
>             proxy.config.lm.pserver_timeout_msecs: 0
> in records.config and running ./bin/traffic_manager. 
> ATS crashes with a segfault in a weird place (while trying to allocate with 
> malloc()). If you inspect the core you would see that it got "stuck" in the 
> loop before it crashed over-flowing the stack (kept allocating buffers on the 
> stack with alloca() until it crashed).
> It is worth considering replacing the alloca() with VLA (which "releases" 
> memory when out of scope on each iteration of the loop) or using ats_malloc() 
> which is supposedly less time-efficient but would be better to handle bigger 
> messages without worrying about stack-overflow. 
> IMO adding a message size limit check is a good practice especially with the 
> current implementation.
> If the code gets "stuck" in the while loop while reading big number of 
> messages in a row from the same socket then the port configured by 
> proxy.config.process_manager.mgmt_port becomes unavailable (connection 
> refused). Adding a limit of messages that can be processed in a row should be 
> a good idea.
> I stumbled up on this while running TSQA regression tests where TSQA kept 
> complaining that the management port is not available and the ATS kept 
> crashing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to