[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14316783#comment-14316783
 ] 

Luca Bruno commented on TS-3386:
--------------------------------

I'm using latest 5.2.0. Just to say, I've read some of the code of the cop and 
it's quite clear: if it fails 2 times, it's killed. Didn't put attention on 
whether it's using a separate connection though.

> Heartbeat failed with high load, trafficserver restarted
> --------------------------------------------------------
>
>                 Key: TS-3386
>                 URL: https://issues.apache.org/jira/browse/TS-3386
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Performance
>            Reporter: Luca Bruno
>
> I've been evaluating ATS for some days. I'm using it with mostly default 
> settings, except I've lowered the number of connections to the backend, I 
> have a raw storage of 500gb, and disabled ram cache.
> Working fine, then I wanted to stress it more. I've increased the test to 
> 1000 concurrent requests, then the ATS worker has been restarted and thus 
> lost the whole cache.
> /var/log/syslog:
> {noformat}
> Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
> status(502)
> Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
> Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
> status(502)
> Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
> Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
> Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
> [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
> Killed
> Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
> [Alarms::signalAlarm] Server Process was reset
> Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
> Starting ---
> Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
> Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
> Feb 10 2015 at 13:04:42)
> Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
> RLIMIT_NOFILE(7):cur(736236),max(736236)
> Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
> status(502)
> Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
> Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
> status(502)
> Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
> Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
> Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
> [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
> Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
> [LocalManager::sendMgmtMsgToProcesses] Error writing message
> Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
> (last system error 32: Broken pipe)
> Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
> signal [32985 256]
> Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
> making sure traffic_server is dead
> Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
> Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
> Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
> ---
> Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
> Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 
> 2015 at 13:05:19)
> Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
> RLIMIT_NOFILE(7):cur(736236),max(736236)
> Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
> Starting ---
> Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server 
> Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
> Feb 10 2015 at 13:04:42)
> Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
> RLIMIT_NOFILE(7):cur(736236),max(736236)
> Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 
> status(502)
> Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
> Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 
> status(502)
> Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2]
> Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server
> Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
> [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
> Killed
> Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
> [Alarms::signalAlarm] Server Process was reset
> Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: --- traffic_server 
> Starting ---
> Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: traffic_server 
> Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
> Feb 10 2015 at 13:04:42)
> Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: 
> RLIMIT_NOFILE(7):cur(736236),max(736236)
> Feb 11 10:06:52 test-cache traffic_cop[32984]: (http test) received non-200 
> status(502)
> Feb 11 10:06:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
> Feb 11 10:07:02 test-cache traffic_cop[32984]: (http test) received non-200 
> status(502)
> Feb 11 10:07:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
> Feb 11 10:07:02 test-cache traffic_cop[32984]: killing server
> Feb 11 10:07:02 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
> [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
> Killed
> Feb 11 10:07:02 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
> [Alarms::signalAlarm] Server Process was reset
> Feb 11 10:07:04 test-cache traffic_server[59088]: NOTE: --- traffic_server 
> Starting ---
> Feb 11 10:07:04 test-cache traffic_server[59088]: NOTE: traffic_server 
> Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
> Feb 10 2015 at 13:04:42)
> Feb 11 10:07:04 test-cache traffic_server[59088]: NOTE: 
> RLIMIT_NOFILE(7):cur(736236),max(736236)
> Feb 11 10:07:12 test-cache traffic_cop[32984]: (http test) received non-200 
> status(502)
> Feb 11 10:07:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
> Feb 11 10:07:22 test-cache traffic_cop[32984]: (http test) received non-200 
> status(502)
> Feb 11 10:07:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
> Feb 11 10:07:22 test-cache traffic_cop[32984]: killing server
> Feb 11 10:07:22 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
> [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
> Killed
> Feb 11 10:07:22 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
> [Alarms::signalAlarm] Server Process was reset
> Feb 11 10:07:25 test-cache traffic_server[59099]: NOTE: --- traffic_server 
> Starting ---
> Feb 11 10:07:25 test-cache traffic_server[59099]: NOTE: traffic_server 
> Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
> Feb 10 2015 at 13:04:42)
> Feb 11 10:07:25 test-cache traffic_server[59099]: NOTE: 
> RLIMIT_NOFILE(7):cur(736236),max(736236)
> Feb 11 10:07:32 test-cache traffic_cop[32984]: (http test) received non-200 
> status(502)
> Feb 11 10:07:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
> Feb 11 10:07:43 test-cache traffic_cop[32984]: server heartbeat succeeded
> {noformat}
> manager.log:
> {noformat}
> [Feb 11 10:06:02.593] Manager {0x7f975c537720} ERROR: 
> [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
> Killed
> [Feb 11 10:06:02.593] Manager {0x7f975c537720} ERROR: [Alarms::signalAlarm] 
> Server Process was reset
> [Feb 11 10:06:03.603] Manager {0x7f975c537720} NOTE: 
> [LocalManager::startProxy] Launching ts process
> [Feb 11 10:06:03.634] Manager {0x7f975c537720} NOTE: 
> [LocalManager::pollMgmtProcessServer] New process connecting fd '11'
> [Feb 11 10:06:03.634] Manager {0x7f975c537720} NOTE: [Alarms::signalAlarm] 
> Server Process born
> [Feb 11 10:06:22.616] Manager {0x7f975c537720} FATAL: 
> [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
> [Feb 11 10:06:22.616] Manager {0x7f975c537720} NOTE: 
> [LocalManager::mgmtShutdown] Executing shutdown request.
> [Feb 11 10:06:22.616] Manager {0x7f975c537720} NOTE: 
> [LocalManager::processShutdown] Executing process shutdown request.
> [Feb 11 10:06:22.622] Manager {0x7f975c537720} ERROR: 
> [LocalManager::sendMgmtMsgToProcesses] Error writing message
> [Feb 11 10:06:22.622] Manager {0x7f975c537720} ERROR:  (last system error 32: 
> Broken pipe)
> [Feb 11 10:06:22.764] {0x7f2c94ded720} STATUS: opened 
> /usr/local/var/log/trafficserver/manager.log
> [Feb 11 10:06:22.764] {0x7f2c94ded720} NOTE: updated diags config
> [Feb 11 10:06:22.768] Manager {0x7f2c94ded720} NOTE: [ClusterCom::ClusterCom] 
> Node running on OS: 'Linux' Release: '3.2.0-4-amd64'
> [Feb 11 10:06:22.769] Manager {0x7f2c94ded720} NOTE: 
> [LocalManager::listenForProxy] Listening on port: 8080
> [Feb 11 10:06:22.769] Manager {0x7f2c94ded720} NOTE: [TrafficManager] Setup 
> complete
> [Feb 11 10:06:23.827] Manager {0x7f2c94ded720} NOTE: 
> [LocalManager::startProxy] Launching ts process
> [Feb 11 10:06:23.890] Manager {0x7f2c94ded720} NOTE: 
> [LocalManager::pollMgmtProcessServer] New process connecting fd '13'
> [Feb 11 10:06:23.890] Manager {0x7f2c94ded720} NOTE: [Alarms::signalAlarm] 
> Server Process born
> [Feb 11 10:06:42.786] Manager {0x7f2c94ded720} ERROR: 
> [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
> Killed
> [Feb 11 10:06:42.786] Manager {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] 
> Server Process was reset
> [Feb 11 10:06:43.813] Manager {0x7f2c94ded720} NOTE: 
> [LocalManager::startProxy] Launching ts process
> [Feb 11 10:06:43.879] Manager {0x7f2c94ded720} NOTE: 
> [LocalManager::pollMgmtProcessServer] New process connecting fd '13'
> [Feb 11 10:06:43.879] Manager {0x7f2c94ded720} NOTE: [Alarms::signalAlarm] 
> Server Process born
> [Feb 11 10:07:02.886] Manager {0x7f2c94ded720} ERROR: 
> [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
> Killed
> [Feb 11 10:07:02.886] Manager {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] 
> Server Process was reset
> [Feb 11 10:07:03.914] Manager {0x7f2c94ded720} NOTE: 
> [LocalManager::startProxy] Launching ts process
> [Feb 11 10:07:03.927] Manager {0x7f2c94ded720} NOTE: 
> [LocalManager::pollMgmtProcessServer] New process connecting fd '13'
> [Feb 11 10:07:03.927] Manager {0x7f2c94ded720} NOTE: [Alarms::signalAlarm] 
> Server Process born
> [Feb 11 10:07:22.988] Manager {0x7f2c94ded720} ERROR: 
> [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
> Killed
> [Feb 11 10:07:22.989] Manager {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] 
> Server Process was reset
> [Feb 11 10:07:23.997] Manager {0x7f2c94ded720} NOTE: 
> [LocalManager::startProxy] Launching ts process
> [Feb 11 10:07:24.046] Manager {0x7f2c94ded720} NOTE: 
> [LocalManager::pollMgmtProcessServer] New process connecting fd '13'
> [Feb 11 10:07:24.046] Manager {0x7f2c94ded720} NOTE: [Alarms::signalAlarm] 
> Server Process born
> {noformat}
> Note the "(http test) received non-200 status(502)" in the syslog, so it's 
> like the process didn't respond gracefully to the heartbeat. 
> Not sure if it was a problem of the backend, or of the trafficserver worker. 
> Either ways, how do I make trafficserver more robust against this condition? 
> I never want to lose the cache in such cases where trafficserver is under 
> high load. What could be the options in the config to tweak to achieve more 
> stability?
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to