[ https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14316671#comment-14316671 ]
Leif Hedstrom commented on TS-3386: ----------------------------------- I'm pretty sure there was a Jira on exactly that suggestion [~jpe...@apache.org] > Heartbeat failed with high load, trafficserver restarted > -------------------------------------------------------- > > Key: TS-3386 > URL: https://issues.apache.org/jira/browse/TS-3386 > Project: Traffic Server > Issue Type: Bug > Components: Performance > Reporter: Luca Bruno > > I've been evaluating ATS for some days. I'm using it with mostly default > settings, except I've lowered the number of connections to the backend, I > have a raw storage of 500gb, and disabled ram cache. > Working fine, then I wanted to stress it more. I've increased the test to > 1000 concurrent requests, then the ATS worker has been restarted and thus > lost the whole cache. > /var/log/syslog: > {noformat} > Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 > status(502) > Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1] > Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 > status(502) > Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2] > Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server > Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: > [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: > Killed > Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: > [Alarms::signalAlarm] Server Process was reset > Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server > Starting --- > Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server > Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on > Feb 10 2015 at 13:04:42) > Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: > RLIMIT_NOFILE(7):cur(736236),max(736236) > Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 > status(502) > Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1] > Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 > status(502) > Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2] > Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server > Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: > [LocalManager::pollMgmtProcessServer] Error in read (errno: 104) > Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: > [LocalManager::sendMgmtMsgToProcesses] Error writing message > Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: > (last system error 32: Broken pipe) > Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status > signal [32985 256] > Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, > making sure traffic_server is dead > Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager > Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin > Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting > --- > Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: > Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 > 2015 at 13:05:19) > Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: > RLIMIT_NOFILE(7):cur(736236),max(736236) > Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server > Starting --- > Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server > Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on > Feb 10 2015 at 13:04:42) > Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: > RLIMIT_NOFILE(7):cur(736236),max(736236) > Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 > status(502) > Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1] > Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 > status(502) > Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2] > Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server > Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: > [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: > Killed > Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: > [Alarms::signalAlarm] Server Process was reset > Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: --- traffic_server > Starting --- > Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: traffic_server > Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on > Feb 10 2015 at 13:04:42) > Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: > RLIMIT_NOFILE(7):cur(736236),max(736236) > Feb 11 10:06:52 test-cache traffic_cop[32984]: (http test) received non-200 > status(502) > Feb 11 10:06:52 test-cache traffic_cop[32984]: server heartbeat failed [1] > Feb 11 10:07:02 test-cache traffic_cop[32984]: (http test) received non-200 > status(502) > Feb 11 10:07:02 test-cache traffic_cop[32984]: server heartbeat failed [2] > Feb 11 10:07:02 test-cache traffic_cop[32984]: killing server > Feb 11 10:07:02 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: > [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: > Killed > Feb 11 10:07:02 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: > [Alarms::signalAlarm] Server Process was reset > Feb 11 10:07:04 test-cache traffic_server[59088]: NOTE: --- traffic_server > Starting --- > Feb 11 10:07:04 test-cache traffic_server[59088]: NOTE: traffic_server > Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on > Feb 10 2015 at 13:04:42) > Feb 11 10:07:04 test-cache traffic_server[59088]: NOTE: > RLIMIT_NOFILE(7):cur(736236),max(736236) > Feb 11 10:07:12 test-cache traffic_cop[32984]: (http test) received non-200 > status(502) > Feb 11 10:07:12 test-cache traffic_cop[32984]: server heartbeat failed [1] > Feb 11 10:07:22 test-cache traffic_cop[32984]: (http test) received non-200 > status(502) > Feb 11 10:07:22 test-cache traffic_cop[32984]: server heartbeat failed [2] > Feb 11 10:07:22 test-cache traffic_cop[32984]: killing server > Feb 11 10:07:22 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: > [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: > Killed > Feb 11 10:07:22 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: > [Alarms::signalAlarm] Server Process was reset > Feb 11 10:07:25 test-cache traffic_server[59099]: NOTE: --- traffic_server > Starting --- > Feb 11 10:07:25 test-cache traffic_server[59099]: NOTE: traffic_server > Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on > Feb 10 2015 at 13:04:42) > Feb 11 10:07:25 test-cache traffic_server[59099]: NOTE: > RLIMIT_NOFILE(7):cur(736236),max(736236) > Feb 11 10:07:32 test-cache traffic_cop[32984]: (http test) received non-200 > status(502) > Feb 11 10:07:32 test-cache traffic_cop[32984]: server heartbeat failed [1] > Feb 11 10:07:43 test-cache traffic_cop[32984]: server heartbeat succeeded > {noformat} > manager.log: > {noformat} > [Feb 11 10:06:02.593] Manager {0x7f975c537720} ERROR: > [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: > Killed > [Feb 11 10:06:02.593] Manager {0x7f975c537720} ERROR: [Alarms::signalAlarm] > Server Process was reset > [Feb 11 10:06:03.603] Manager {0x7f975c537720} NOTE: > [LocalManager::startProxy] Launching ts process > [Feb 11 10:06:03.634] Manager {0x7f975c537720} NOTE: > [LocalManager::pollMgmtProcessServer] New process connecting fd '11' > [Feb 11 10:06:03.634] Manager {0x7f975c537720} NOTE: [Alarms::signalAlarm] > Server Process born > [Feb 11 10:06:22.616] Manager {0x7f975c537720} FATAL: > [LocalManager::pollMgmtProcessServer] Error in read (errno: 104) > [Feb 11 10:06:22.616] Manager {0x7f975c537720} NOTE: > [LocalManager::mgmtShutdown] Executing shutdown request. > [Feb 11 10:06:22.616] Manager {0x7f975c537720} NOTE: > [LocalManager::processShutdown] Executing process shutdown request. > [Feb 11 10:06:22.622] Manager {0x7f975c537720} ERROR: > [LocalManager::sendMgmtMsgToProcesses] Error writing message > [Feb 11 10:06:22.622] Manager {0x7f975c537720} ERROR: (last system error 32: > Broken pipe) > [Feb 11 10:06:22.764] {0x7f2c94ded720} STATUS: opened > /usr/local/var/log/trafficserver/manager.log > [Feb 11 10:06:22.764] {0x7f2c94ded720} NOTE: updated diags config > [Feb 11 10:06:22.768] Manager {0x7f2c94ded720} NOTE: [ClusterCom::ClusterCom] > Node running on OS: 'Linux' Release: '3.2.0-4-amd64' > [Feb 11 10:06:22.769] Manager {0x7f2c94ded720} NOTE: > [LocalManager::listenForProxy] Listening on port: 8080 > [Feb 11 10:06:22.769] Manager {0x7f2c94ded720} NOTE: [TrafficManager] Setup > complete > [Feb 11 10:06:23.827] Manager {0x7f2c94ded720} NOTE: > [LocalManager::startProxy] Launching ts process > [Feb 11 10:06:23.890] Manager {0x7f2c94ded720} NOTE: > [LocalManager::pollMgmtProcessServer] New process connecting fd '13' > [Feb 11 10:06:23.890] Manager {0x7f2c94ded720} NOTE: [Alarms::signalAlarm] > Server Process born > [Feb 11 10:06:42.786] Manager {0x7f2c94ded720} ERROR: > [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: > Killed > [Feb 11 10:06:42.786] Manager {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] > Server Process was reset > [Feb 11 10:06:43.813] Manager {0x7f2c94ded720} NOTE: > [LocalManager::startProxy] Launching ts process > [Feb 11 10:06:43.879] Manager {0x7f2c94ded720} NOTE: > [LocalManager::pollMgmtProcessServer] New process connecting fd '13' > [Feb 11 10:06:43.879] Manager {0x7f2c94ded720} NOTE: [Alarms::signalAlarm] > Server Process born > [Feb 11 10:07:02.886] Manager {0x7f2c94ded720} ERROR: > [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: > Killed > [Feb 11 10:07:02.886] Manager {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] > Server Process was reset > [Feb 11 10:07:03.914] Manager {0x7f2c94ded720} NOTE: > [LocalManager::startProxy] Launching ts process > [Feb 11 10:07:03.927] Manager {0x7f2c94ded720} NOTE: > [LocalManager::pollMgmtProcessServer] New process connecting fd '13' > [Feb 11 10:07:03.927] Manager {0x7f2c94ded720} NOTE: [Alarms::signalAlarm] > Server Process born > [Feb 11 10:07:22.988] Manager {0x7f2c94ded720} ERROR: > [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: > Killed > [Feb 11 10:07:22.989] Manager {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] > Server Process was reset > [Feb 11 10:07:23.997] Manager {0x7f2c94ded720} NOTE: > [LocalManager::startProxy] Launching ts process > [Feb 11 10:07:24.046] Manager {0x7f2c94ded720} NOTE: > [LocalManager::pollMgmtProcessServer] New process connecting fd '13' > [Feb 11 10:07:24.046] Manager {0x7f2c94ded720} NOTE: [Alarms::signalAlarm] > Server Process born > {noformat} > Note the "(http test) received non-200 status(502)" in the syslog, so it's > like the process didn't respond gracefully to the heartbeat. > Not sure if it was a problem of the backend, or of the trafficserver worker. > Either ways, how do I make trafficserver more robust against this condition? > I never want to lose the cache in such cases where trafficserver is under > high load. What could be the options in the config to tweak to achieve more > stability? > Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)