[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14316517#comment-14316517
 ] 

Luca Bruno edited comment on TS-3386 at 2/11/15 5:26 PM:
---------------------------------------------------------

James:

I've pasted the manager.log above, dags had nothing related. And syslog clearly 
states that the server returned 502 to cop and decided to kill the server after 
2 times that it failed:
{noformat}
Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
{noformat}

But I haven't found anything in the logs on why it returns 502, perhaps the 
connections to the origin server are full and it returns bad gateway for that 
reason also to the clients?

To reproduce, set connection throttle to 100 and perform 1000 concurrent 
requests to ATS with minimal hit ratio (so that it also creates connections to 
the origin server).

Zhao:

q1: Whenever I restart ATS with `trafficserver restart`, the cache starts from 
zero. It's a 500gb raw disk cache. It's not supposed to work like this? Perhaps 
I'm missing some option (EDIT: it re-reads the cache, see comment below). 
Either ways, it would keep restarting the server, so fixing the cache issue 
wouldn't solve this problem.

q2: I'm just testing stuff. Of course I won't be limiting the number of 
connections like that, but I still find this a bug.
q2.1: not sure sorry, what is httpSM?
q2.2: I think there are 2 options: either don't accept(), or accept and 
immediately close the connection with an error (which is apparently what the 
server does, except that it does also for the heartbeat). For example it's not 
that nginx kills its worker children because it reached the maximum number of 
connections.

Sure I keep it simple, but robustness is also the fact that the server must not 
be killed just because it's busy. Being busy means it's working and must not be 
killed by the heartbeat. In which case it can be solved by using a dedicated 
connection for the heartbeat in my opinion OR not kill at all yet keep alerting 
in the log files for external monitoring.




was (Author: lethalman):
James:

I've pasted the manager.log above, dags had nothing related. And syslog clearly 
states that the server returned 502 to cop and decided to kill the server after 
2 times that it failed:
{noformat}
Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
{noformat}

But I haven't found anything in the logs on why it returns 502, perhaps the 
connections to the origin server are full and it returns bad gateway for that 
reason also to the clients?

To reproduce, set connection throttle to 100 and perform 1000 concurrent 
requests to ATS with minimal hit ratio (so that it also creates connections to 
the origin server).

Zhao:

q1: Whenever I restart ATS with `trafficserver restart`, the cache starts from 
zero. It's a 500gb raw disk cache. It's not supposed to work like this? Perhaps 
I'm missing some option. Either ways, it would keep restarting the server, so 
fixing the cache issue wouldn't solve this problem.

q2: I'm just testing stuff. Of course I won't be limiting the number of 
connections like that, but I still find this a bug.
q2.1: not sure sorry, what is httpSM?
q2.2: I think there are 2 options: either don't accept(), or accept and 
immediately close the connection with an error (which is apparently what the 
server does, except that it does also for the heartbeat). For example it's not 
that nginx kills its worker children because it reached the maximum number of 
connections.

Sure I keep it simple, but robustness is also the fact that the server must not 
be killed just because it's busy. Being busy means it's working and must not be 
killed by the heartbeat. In which case it can be solved by using a dedicated 
connection for the heartbeat in my opinion OR not kill at all yet keep alerting 
in the log files for external monitoring.



> Heartbeat failed with high load, trafficserver restarted
> --------------------------------------------------------
>
>                 Key: TS-3386
>                 URL: https://issues.apache.org/jira/browse/TS-3386
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Performance
>            Reporter: Luca Bruno
>
> I've been evaluating ATS for some days. I'm using it with mostly default 
> settings, except I've lowered the number of connections to the backend, I 
> have a raw storage of 500gb, and disabled ram cache.
> Working fine, then I wanted to stress it more. I've increased the test to 
> 1000 concurrent requests, then the ATS worker has been restarted and thus 
> lost the whole cache.
> /var/log/syslog:
> {noformat}
> Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
> status(502)
> Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
> Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
> status(502)
> Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
> Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
> Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
> [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
> Killed
> Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
> [Alarms::signalAlarm] Server Process was reset
> Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
> Starting ---
> Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
> Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
> Feb 10 2015 at 13:04:42)
> Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
> RLIMIT_NOFILE(7):cur(736236),max(736236)
> Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
> status(502)
> Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
> Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
> status(502)
> Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
> Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
> Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
> [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
> Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
> [LocalManager::sendMgmtMsgToProcesses] Error writing message
> Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
> (last system error 32: Broken pipe)
> Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
> signal [32985 256]
> Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
> making sure traffic_server is dead
> Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
> Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
> Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
> ---
> Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
> Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 
> 2015 at 13:05:19)
> Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
> RLIMIT_NOFILE(7):cur(736236),max(736236)
> Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
> Starting ---
> Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server 
> Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
> Feb 10 2015 at 13:04:42)
> Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
> RLIMIT_NOFILE(7):cur(736236),max(736236)
> Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 
> status(502)
> Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
> Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 
> status(502)
> Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2]
> Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server
> Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
> [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
> Killed
> Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
> [Alarms::signalAlarm] Server Process was reset
> Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: --- traffic_server 
> Starting ---
> Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: traffic_server 
> Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
> Feb 10 2015 at 13:04:42)
> Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: 
> RLIMIT_NOFILE(7):cur(736236),max(736236)
> Feb 11 10:06:52 test-cache traffic_cop[32984]: (http test) received non-200 
> status(502)
> Feb 11 10:06:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
> Feb 11 10:07:02 test-cache traffic_cop[32984]: (http test) received non-200 
> status(502)
> Feb 11 10:07:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
> Feb 11 10:07:02 test-cache traffic_cop[32984]: killing server
> Feb 11 10:07:02 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
> [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
> Killed
> Feb 11 10:07:02 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
> [Alarms::signalAlarm] Server Process was reset
> Feb 11 10:07:04 test-cache traffic_server[59088]: NOTE: --- traffic_server 
> Starting ---
> Feb 11 10:07:04 test-cache traffic_server[59088]: NOTE: traffic_server 
> Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
> Feb 10 2015 at 13:04:42)
> Feb 11 10:07:04 test-cache traffic_server[59088]: NOTE: 
> RLIMIT_NOFILE(7):cur(736236),max(736236)
> Feb 11 10:07:12 test-cache traffic_cop[32984]: (http test) received non-200 
> status(502)
> Feb 11 10:07:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
> Feb 11 10:07:22 test-cache traffic_cop[32984]: (http test) received non-200 
> status(502)
> Feb 11 10:07:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
> Feb 11 10:07:22 test-cache traffic_cop[32984]: killing server
> Feb 11 10:07:22 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
> [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
> Killed
> Feb 11 10:07:22 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
> [Alarms::signalAlarm] Server Process was reset
> Feb 11 10:07:25 test-cache traffic_server[59099]: NOTE: --- traffic_server 
> Starting ---
> Feb 11 10:07:25 test-cache traffic_server[59099]: NOTE: traffic_server 
> Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
> Feb 10 2015 at 13:04:42)
> Feb 11 10:07:25 test-cache traffic_server[59099]: NOTE: 
> RLIMIT_NOFILE(7):cur(736236),max(736236)
> Feb 11 10:07:32 test-cache traffic_cop[32984]: (http test) received non-200 
> status(502)
> Feb 11 10:07:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
> Feb 11 10:07:43 test-cache traffic_cop[32984]: server heartbeat succeeded
> {noformat}
> manager.log:
> {noformat}
> [Feb 11 10:06:02.593] Manager {0x7f975c537720} ERROR: 
> [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
> Killed
> [Feb 11 10:06:02.593] Manager {0x7f975c537720} ERROR: [Alarms::signalAlarm] 
> Server Process was reset
> [Feb 11 10:06:03.603] Manager {0x7f975c537720} NOTE: 
> [LocalManager::startProxy] Launching ts process
> [Feb 11 10:06:03.634] Manager {0x7f975c537720} NOTE: 
> [LocalManager::pollMgmtProcessServer] New process connecting fd '11'
> [Feb 11 10:06:03.634] Manager {0x7f975c537720} NOTE: [Alarms::signalAlarm] 
> Server Process born
> [Feb 11 10:06:22.616] Manager {0x7f975c537720} FATAL: 
> [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
> [Feb 11 10:06:22.616] Manager {0x7f975c537720} NOTE: 
> [LocalManager::mgmtShutdown] Executing shutdown request.
> [Feb 11 10:06:22.616] Manager {0x7f975c537720} NOTE: 
> [LocalManager::processShutdown] Executing process shutdown request.
> [Feb 11 10:06:22.622] Manager {0x7f975c537720} ERROR: 
> [LocalManager::sendMgmtMsgToProcesses] Error writing message
> [Feb 11 10:06:22.622] Manager {0x7f975c537720} ERROR:  (last system error 32: 
> Broken pipe)
> [Feb 11 10:06:22.764] {0x7f2c94ded720} STATUS: opened 
> /usr/local/var/log/trafficserver/manager.log
> [Feb 11 10:06:22.764] {0x7f2c94ded720} NOTE: updated diags config
> [Feb 11 10:06:22.768] Manager {0x7f2c94ded720} NOTE: [ClusterCom::ClusterCom] 
> Node running on OS: 'Linux' Release: '3.2.0-4-amd64'
> [Feb 11 10:06:22.769] Manager {0x7f2c94ded720} NOTE: 
> [LocalManager::listenForProxy] Listening on port: 8080
> [Feb 11 10:06:22.769] Manager {0x7f2c94ded720} NOTE: [TrafficManager] Setup 
> complete
> [Feb 11 10:06:23.827] Manager {0x7f2c94ded720} NOTE: 
> [LocalManager::startProxy] Launching ts process
> [Feb 11 10:06:23.890] Manager {0x7f2c94ded720} NOTE: 
> [LocalManager::pollMgmtProcessServer] New process connecting fd '13'
> [Feb 11 10:06:23.890] Manager {0x7f2c94ded720} NOTE: [Alarms::signalAlarm] 
> Server Process born
> [Feb 11 10:06:42.786] Manager {0x7f2c94ded720} ERROR: 
> [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
> Killed
> [Feb 11 10:06:42.786] Manager {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] 
> Server Process was reset
> [Feb 11 10:06:43.813] Manager {0x7f2c94ded720} NOTE: 
> [LocalManager::startProxy] Launching ts process
> [Feb 11 10:06:43.879] Manager {0x7f2c94ded720} NOTE: 
> [LocalManager::pollMgmtProcessServer] New process connecting fd '13'
> [Feb 11 10:06:43.879] Manager {0x7f2c94ded720} NOTE: [Alarms::signalAlarm] 
> Server Process born
> [Feb 11 10:07:02.886] Manager {0x7f2c94ded720} ERROR: 
> [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
> Killed
> [Feb 11 10:07:02.886] Manager {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] 
> Server Process was reset
> [Feb 11 10:07:03.914] Manager {0x7f2c94ded720} NOTE: 
> [LocalManager::startProxy] Launching ts process
> [Feb 11 10:07:03.927] Manager {0x7f2c94ded720} NOTE: 
> [LocalManager::pollMgmtProcessServer] New process connecting fd '13'
> [Feb 11 10:07:03.927] Manager {0x7f2c94ded720} NOTE: [Alarms::signalAlarm] 
> Server Process born
> [Feb 11 10:07:22.988] Manager {0x7f2c94ded720} ERROR: 
> [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
> Killed
> [Feb 11 10:07:22.989] Manager {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] 
> Server Process was reset
> [Feb 11 10:07:23.997] Manager {0x7f2c94ded720} NOTE: 
> [LocalManager::startProxy] Launching ts process
> [Feb 11 10:07:24.046] Manager {0x7f2c94ded720} NOTE: 
> [LocalManager::pollMgmtProcessServer] New process connecting fd '13'
> [Feb 11 10:07:24.046] Manager {0x7f2c94ded720} NOTE: [Alarms::signalAlarm] 
> Server Process born
> {noformat}
> Note the "(http test) received non-200 status(502)" in the syslog, so it's 
> like the process didn't respond gracefully to the heartbeat. 
> Not sure if it was a problem of the backend, or of the trafficserver worker. 
> Either ways, how do I make trafficserver more robust against this condition? 
> I never want to lose the cache in such cases where trafficserver is under 
> high load. What could be the options in the config to tweak to achieve more 
> stability?
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to