[
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14316143#comment-14316143
]
Luca Bruno edited comment on TS-3386 at 2/11/15 12:31 PM:
----------------------------------------------------------
So it's like the server saturated the connections, and it replied 502 even to
the heartbeat server, without analyzing the response. Am I right? In this case,
I'd either like to disable heartbeat completely or start a second tcp server
just for heartbeat, instead of mixing heartbeat + client connections within the
same server.
was (Author: lethalman):
So it's like the server saturated the connections, and it replied 502 even to
the heartbeat server, without analyzing the response. Am I right? In this case,
I'd either like to disable heartbeat completely or start a second tcp server
just for heartbeat, isntead of mixing heartbeat + client connections within the
same server.
> Heartbeat failed with high load, trafficserver restarted
> --------------------------------------------------------
>
> Key: TS-3386
> URL: https://issues.apache.org/jira/browse/TS-3386
> Project: Traffic Server
> Issue Type: Bug
> Components: Performance
> Reporter: Luca Bruno
>
> I've been evaluating ATS for some days. I'm using it with mostly default
> settings, except I've lowered the number of connections to the backend, I
> have a raw storage of 500gb, and disabled ram cache.
> Working fine, then I wanted to stress it more. I've increased the test to
> 1000 concurrent requests, then the ATS worker has been restarted and thus
> lost the whole cache.
> /var/log/syslog:
> {noformat}
> Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200
> status(502)
> Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
> Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200
> status(502)
> Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
> Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
> Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:
> [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9:
> Killed
> Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:
> [Alarms::signalAlarm] Server Process was reset
> Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server
> Starting ---
> Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server
> Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on
> Feb 10 2015 at 13:04:42)
> Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE:
> RLIMIT_NOFILE(7):cur(736236),max(736236)
> Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200
> status(502)
> Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
> Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200
> status(502)
> Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
> Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
> Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL:
> [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
> Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:
> [LocalManager::sendMgmtMsgToProcesses] Error writing message
> Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:
> (last system error 32: Broken pipe)
> Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status
> signal [32985 256]
> Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running,
> making sure traffic_server is dead
> Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
> Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
> Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting
> ---
> Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version:
> Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10
> 2015 at 13:05:19)
> Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE:
> RLIMIT_NOFILE(7):cur(736236),max(736236)
> Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server
> Starting ---
> Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server
> Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on
> Feb 10 2015 at 13:04:42)
> Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE:
> RLIMIT_NOFILE(7):cur(736236),max(736236)
> Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200
> status(502)
> Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
> Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200
> status(502)
> Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2]
> Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server
> Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR:
> [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9:
> Killed
> Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR:
> [Alarms::signalAlarm] Server Process was reset
> Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: --- traffic_server
> Starting ---
> Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: traffic_server
> Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on
> Feb 10 2015 at 13:04:42)
> Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE:
> RLIMIT_NOFILE(7):cur(736236),max(736236)
> Feb 11 10:06:52 test-cache traffic_cop[32984]: (http test) received non-200
> status(502)
> Feb 11 10:06:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
> Feb 11 10:07:02 test-cache traffic_cop[32984]: (http test) received non-200
> status(502)
> Feb 11 10:07:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
> Feb 11 10:07:02 test-cache traffic_cop[32984]: killing server
> Feb 11 10:07:02 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR:
> [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9:
> Killed
> Feb 11 10:07:02 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR:
> [Alarms::signalAlarm] Server Process was reset
> Feb 11 10:07:04 test-cache traffic_server[59088]: NOTE: --- traffic_server
> Starting ---
> Feb 11 10:07:04 test-cache traffic_server[59088]: NOTE: traffic_server
> Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on
> Feb 10 2015 at 13:04:42)
> Feb 11 10:07:04 test-cache traffic_server[59088]: NOTE:
> RLIMIT_NOFILE(7):cur(736236),max(736236)
> Feb 11 10:07:12 test-cache traffic_cop[32984]: (http test) received non-200
> status(502)
> Feb 11 10:07:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
> Feb 11 10:07:22 test-cache traffic_cop[32984]: (http test) received non-200
> status(502)
> Feb 11 10:07:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
> Feb 11 10:07:22 test-cache traffic_cop[32984]: killing server
> Feb 11 10:07:22 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR:
> [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9:
> Killed
> Feb 11 10:07:22 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR:
> [Alarms::signalAlarm] Server Process was reset
> Feb 11 10:07:25 test-cache traffic_server[59099]: NOTE: --- traffic_server
> Starting ---
> Feb 11 10:07:25 test-cache traffic_server[59099]: NOTE: traffic_server
> Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on
> Feb 10 2015 at 13:04:42)
> Feb 11 10:07:25 test-cache traffic_server[59099]: NOTE:
> RLIMIT_NOFILE(7):cur(736236),max(736236)
> Feb 11 10:07:32 test-cache traffic_cop[32984]: (http test) received non-200
> status(502)
> Feb 11 10:07:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
> Feb 11 10:07:43 test-cache traffic_cop[32984]: server heartbeat succeeded
> {noformat}
> manager.log:
> {noformat}
> [Feb 11 10:06:02.593] Manager {0x7f975c537720} ERROR:
> [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9:
> Killed
> [Feb 11 10:06:02.593] Manager {0x7f975c537720} ERROR: [Alarms::signalAlarm]
> Server Process was reset
> [Feb 11 10:06:03.603] Manager {0x7f975c537720} NOTE:
> [LocalManager::startProxy] Launching ts process
> [Feb 11 10:06:03.634] Manager {0x7f975c537720} NOTE:
> [LocalManager::pollMgmtProcessServer] New process connecting fd '11'
> [Feb 11 10:06:03.634] Manager {0x7f975c537720} NOTE: [Alarms::signalAlarm]
> Server Process born
> [Feb 11 10:06:22.616] Manager {0x7f975c537720} FATAL:
> [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
> [Feb 11 10:06:22.616] Manager {0x7f975c537720} NOTE:
> [LocalManager::mgmtShutdown] Executing shutdown request.
> [Feb 11 10:06:22.616] Manager {0x7f975c537720} NOTE:
> [LocalManager::processShutdown] Executing process shutdown request.
> [Feb 11 10:06:22.622] Manager {0x7f975c537720} ERROR:
> [LocalManager::sendMgmtMsgToProcesses] Error writing message
> [Feb 11 10:06:22.622] Manager {0x7f975c537720} ERROR: (last system error 32:
> Broken pipe)
> [Feb 11 10:06:22.764] {0x7f2c94ded720} STATUS: opened
> /usr/local/var/log/trafficserver/manager.log
> [Feb 11 10:06:22.764] {0x7f2c94ded720} NOTE: updated diags config
> [Feb 11 10:06:22.768] Manager {0x7f2c94ded720} NOTE: [ClusterCom::ClusterCom]
> Node running on OS: 'Linux' Release: '3.2.0-4-amd64'
> [Feb 11 10:06:22.769] Manager {0x7f2c94ded720} NOTE:
> [LocalManager::listenForProxy] Listening on port: 8080
> [Feb 11 10:06:22.769] Manager {0x7f2c94ded720} NOTE: [TrafficManager] Setup
> complete
> [Feb 11 10:06:23.827] Manager {0x7f2c94ded720} NOTE:
> [LocalManager::startProxy] Launching ts process
> [Feb 11 10:06:23.890] Manager {0x7f2c94ded720} NOTE:
> [LocalManager::pollMgmtProcessServer] New process connecting fd '13'
> [Feb 11 10:06:23.890] Manager {0x7f2c94ded720} NOTE: [Alarms::signalAlarm]
> Server Process born
> [Feb 11 10:06:42.786] Manager {0x7f2c94ded720} ERROR:
> [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9:
> Killed
> [Feb 11 10:06:42.786] Manager {0x7f2c94ded720} ERROR: [Alarms::signalAlarm]
> Server Process was reset
> [Feb 11 10:06:43.813] Manager {0x7f2c94ded720} NOTE:
> [LocalManager::startProxy] Launching ts process
> [Feb 11 10:06:43.879] Manager {0x7f2c94ded720} NOTE:
> [LocalManager::pollMgmtProcessServer] New process connecting fd '13'
> [Feb 11 10:06:43.879] Manager {0x7f2c94ded720} NOTE: [Alarms::signalAlarm]
> Server Process born
> [Feb 11 10:07:02.886] Manager {0x7f2c94ded720} ERROR:
> [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9:
> Killed
> [Feb 11 10:07:02.886] Manager {0x7f2c94ded720} ERROR: [Alarms::signalAlarm]
> Server Process was reset
> [Feb 11 10:07:03.914] Manager {0x7f2c94ded720} NOTE:
> [LocalManager::startProxy] Launching ts process
> [Feb 11 10:07:03.927] Manager {0x7f2c94ded720} NOTE:
> [LocalManager::pollMgmtProcessServer] New process connecting fd '13'
> [Feb 11 10:07:03.927] Manager {0x7f2c94ded720} NOTE: [Alarms::signalAlarm]
> Server Process born
> [Feb 11 10:07:22.988] Manager {0x7f2c94ded720} ERROR:
> [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9:
> Killed
> [Feb 11 10:07:22.989] Manager {0x7f2c94ded720} ERROR: [Alarms::signalAlarm]
> Server Process was reset
> [Feb 11 10:07:23.997] Manager {0x7f2c94ded720} NOTE:
> [LocalManager::startProxy] Launching ts process
> [Feb 11 10:07:24.046] Manager {0x7f2c94ded720} NOTE:
> [LocalManager::pollMgmtProcessServer] New process connecting fd '13'
> [Feb 11 10:07:24.046] Manager {0x7f2c94ded720} NOTE: [Alarms::signalAlarm]
> Server Process born
> {noformat}
> Note the "(http test) received non-200 status(502)" in the syslog, so it's
> like the process didn't respond gracefully to the heartbeat.
> Not sure if it was a problem of the backend, or of the trafficserver worker.
> Either ways, how do I make trafficserver more robust against this condition?
> I never want to lose the cache in such cases where trafficserver is under
> high load. What could be the options in the config to tweak to achieve more
> stability?
> Thanks.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)