[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14316316#comment-14316316
 ] 

Luca Bruno commented on TS-3386:
--------------------------------

Ok my remap is now:
{noformat}
map http://test-cache.local:8080/ http://test-cache.local:8081/
{noformat}

Where test-cache.local is a real entry in a dns server of the local network. 
Things didn't change much.

Btw I've found that in the error.log there's something valuable (it also 
happened before), this is repeated indefinitely in the error.log:

{noformat}
20150211.15h53m56s RESPONSE: sent 192.168.199.31 status 502 (Connect Error 
<internal error - server connection terminated/-19999>) for 
'http://test-cache.service.farm:8081/storage/test29429?size=11023&age=46230&sleep=0'
{noformat}

Please note that sending those concurrent requests to the backend server work 
just fine without any error. Anyway, I believe an error in the backend should 
not result in ATS being restarted.

Debug:

{noformat}
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_seq) 
[HttpSM::do_hostdb_lookup] Doing DNS Lookup
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_trans) 
[ink_cluster_time] local: 1423666415, highest_delta: 0, cluster: 1423666415
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_trans) 
[HttpTransact::OSDNSLookup] This was attempt 1
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_seq) 
[HttpTransact::OSDNSLookup] DNS Lookup successful
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_trans) [OSDNSLookup] 
DNS lookup for O.S. successful IP: 192.168.x.y
{noformat}

Then I noticed this:

{noformat}
[Feb 11 15:53:35.043] Server {0x2b7ddc565e00} DEBUG: (http) [293] hostdb update 
marking IP: 192.168.x.y:8081 as down
{noformat}
{noformat}

> Heartbeat failed with high load, trafficserver restarted
> --------------------------------------------------------
>
>                 Key: TS-3386
>                 URL: https://issues.apache.org/jira/browse/TS-3386
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Performance
>            Reporter: Luca Bruno
>
> I've been evaluating ATS for some days. I'm using it with mostly default 
> settings, except I've lowered the number of connections to the backend, I 
> have a raw storage of 500gb, and disabled ram cache.
> Working fine, then I wanted to stress it more. I've increased the test to 
> 1000 concurrent requests, then the ATS worker has been restarted and thus 
> lost the whole cache.
> /var/log/syslog:
> {noformat}
> Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
> status(502)
> Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
> Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
> status(502)
> Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
> Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
> Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
> [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
> Killed
> Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
> [Alarms::signalAlarm] Server Process was reset
> Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
> Starting ---
> Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
> Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
> Feb 10 2015 at 13:04:42)
> Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
> RLIMIT_NOFILE(7):cur(736236),max(736236)
> Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
> status(502)
> Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
> Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
> status(502)
> Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
> Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
> Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
> [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
> Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
> [LocalManager::sendMgmtMsgToProcesses] Error writing message
> Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
> (last system error 32: Broken pipe)
> Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
> signal [32985 256]
> Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
> making sure traffic_server is dead
> Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
> Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
> Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
> ---
> Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
> Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 
> 2015 at 13:05:19)
> Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
> RLIMIT_NOFILE(7):cur(736236),max(736236)
> Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
> Starting ---
> Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server 
> Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
> Feb 10 2015 at 13:04:42)
> Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
> RLIMIT_NOFILE(7):cur(736236),max(736236)
> Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 
> status(502)
> Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
> Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 
> status(502)
> Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2]
> Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server
> Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
> [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
> Killed
> Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
> [Alarms::signalAlarm] Server Process was reset
> Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: --- traffic_server 
> Starting ---
> Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: traffic_server 
> Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
> Feb 10 2015 at 13:04:42)
> Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: 
> RLIMIT_NOFILE(7):cur(736236),max(736236)
> Feb 11 10:06:52 test-cache traffic_cop[32984]: (http test) received non-200 
> status(502)
> Feb 11 10:06:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
> Feb 11 10:07:02 test-cache traffic_cop[32984]: (http test) received non-200 
> status(502)
> Feb 11 10:07:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
> Feb 11 10:07:02 test-cache traffic_cop[32984]: killing server
> Feb 11 10:07:02 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
> [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
> Killed
> Feb 11 10:07:02 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
> [Alarms::signalAlarm] Server Process was reset
> Feb 11 10:07:04 test-cache traffic_server[59088]: NOTE: --- traffic_server 
> Starting ---
> Feb 11 10:07:04 test-cache traffic_server[59088]: NOTE: traffic_server 
> Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
> Feb 10 2015 at 13:04:42)
> Feb 11 10:07:04 test-cache traffic_server[59088]: NOTE: 
> RLIMIT_NOFILE(7):cur(736236),max(736236)
> Feb 11 10:07:12 test-cache traffic_cop[32984]: (http test) received non-200 
> status(502)
> Feb 11 10:07:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
> Feb 11 10:07:22 test-cache traffic_cop[32984]: (http test) received non-200 
> status(502)
> Feb 11 10:07:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
> Feb 11 10:07:22 test-cache traffic_cop[32984]: killing server
> Feb 11 10:07:22 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
> [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
> Killed
> Feb 11 10:07:22 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
> [Alarms::signalAlarm] Server Process was reset
> Feb 11 10:07:25 test-cache traffic_server[59099]: NOTE: --- traffic_server 
> Starting ---
> Feb 11 10:07:25 test-cache traffic_server[59099]: NOTE: traffic_server 
> Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
> Feb 10 2015 at 13:04:42)
> Feb 11 10:07:25 test-cache traffic_server[59099]: NOTE: 
> RLIMIT_NOFILE(7):cur(736236),max(736236)
> Feb 11 10:07:32 test-cache traffic_cop[32984]: (http test) received non-200 
> status(502)
> Feb 11 10:07:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
> Feb 11 10:07:43 test-cache traffic_cop[32984]: server heartbeat succeeded
> {noformat}
> manager.log:
> {noformat}
> [Feb 11 10:06:02.593] Manager {0x7f975c537720} ERROR: 
> [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
> Killed
> [Feb 11 10:06:02.593] Manager {0x7f975c537720} ERROR: [Alarms::signalAlarm] 
> Server Process was reset
> [Feb 11 10:06:03.603] Manager {0x7f975c537720} NOTE: 
> [LocalManager::startProxy] Launching ts process
> [Feb 11 10:06:03.634] Manager {0x7f975c537720} NOTE: 
> [LocalManager::pollMgmtProcessServer] New process connecting fd '11'
> [Feb 11 10:06:03.634] Manager {0x7f975c537720} NOTE: [Alarms::signalAlarm] 
> Server Process born
> [Feb 11 10:06:22.616] Manager {0x7f975c537720} FATAL: 
> [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
> [Feb 11 10:06:22.616] Manager {0x7f975c537720} NOTE: 
> [LocalManager::mgmtShutdown] Executing shutdown request.
> [Feb 11 10:06:22.616] Manager {0x7f975c537720} NOTE: 
> [LocalManager::processShutdown] Executing process shutdown request.
> [Feb 11 10:06:22.622] Manager {0x7f975c537720} ERROR: 
> [LocalManager::sendMgmtMsgToProcesses] Error writing message
> [Feb 11 10:06:22.622] Manager {0x7f975c537720} ERROR:  (last system error 32: 
> Broken pipe)
> [Feb 11 10:06:22.764] {0x7f2c94ded720} STATUS: opened 
> /usr/local/var/log/trafficserver/manager.log
> [Feb 11 10:06:22.764] {0x7f2c94ded720} NOTE: updated diags config
> [Feb 11 10:06:22.768] Manager {0x7f2c94ded720} NOTE: [ClusterCom::ClusterCom] 
> Node running on OS: 'Linux' Release: '3.2.0-4-amd64'
> [Feb 11 10:06:22.769] Manager {0x7f2c94ded720} NOTE: 
> [LocalManager::listenForProxy] Listening on port: 8080
> [Feb 11 10:06:22.769] Manager {0x7f2c94ded720} NOTE: [TrafficManager] Setup 
> complete
> [Feb 11 10:06:23.827] Manager {0x7f2c94ded720} NOTE: 
> [LocalManager::startProxy] Launching ts process
> [Feb 11 10:06:23.890] Manager {0x7f2c94ded720} NOTE: 
> [LocalManager::pollMgmtProcessServer] New process connecting fd '13'
> [Feb 11 10:06:23.890] Manager {0x7f2c94ded720} NOTE: [Alarms::signalAlarm] 
> Server Process born
> [Feb 11 10:06:42.786] Manager {0x7f2c94ded720} ERROR: 
> [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
> Killed
> [Feb 11 10:06:42.786] Manager {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] 
> Server Process was reset
> [Feb 11 10:06:43.813] Manager {0x7f2c94ded720} NOTE: 
> [LocalManager::startProxy] Launching ts process
> [Feb 11 10:06:43.879] Manager {0x7f2c94ded720} NOTE: 
> [LocalManager::pollMgmtProcessServer] New process connecting fd '13'
> [Feb 11 10:06:43.879] Manager {0x7f2c94ded720} NOTE: [Alarms::signalAlarm] 
> Server Process born
> [Feb 11 10:07:02.886] Manager {0x7f2c94ded720} ERROR: 
> [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
> Killed
> [Feb 11 10:07:02.886] Manager {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] 
> Server Process was reset
> [Feb 11 10:07:03.914] Manager {0x7f2c94ded720} NOTE: 
> [LocalManager::startProxy] Launching ts process
> [Feb 11 10:07:03.927] Manager {0x7f2c94ded720} NOTE: 
> [LocalManager::pollMgmtProcessServer] New process connecting fd '13'
> [Feb 11 10:07:03.927] Manager {0x7f2c94ded720} NOTE: [Alarms::signalAlarm] 
> Server Process born
> [Feb 11 10:07:22.988] Manager {0x7f2c94ded720} ERROR: 
> [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
> Killed
> [Feb 11 10:07:22.989] Manager {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] 
> Server Process was reset
> [Feb 11 10:07:23.997] Manager {0x7f2c94ded720} NOTE: 
> [LocalManager::startProxy] Launching ts process
> [Feb 11 10:07:24.046] Manager {0x7f2c94ded720} NOTE: 
> [LocalManager::pollMgmtProcessServer] New process connecting fd '13'
> [Feb 11 10:07:24.046] Manager {0x7f2c94ded720} NOTE: [Alarms::signalAlarm] 
> Server Process born
> {noformat}
> Note the "(http test) received non-200 status(502)" in the syslog, so it's 
> like the process didn't respond gracefully to the heartbeat. 
> Not sure if it was a problem of the backend, or of the trafficserver worker. 
> Either ways, how do I make trafficserver more robust against this condition? 
> I never want to lose the cache in such cases where trafficserver is under 
> high load. What could be the options in the config to tweak to achieve more 
> stability?
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to