Spidey01 commented on issue #8614:
URL: https://github.com/apache/trafficserver/issues/8614#issuecomment-1319122346

   Running  ATS 9.1.3, I've observed something similar, perhaps the same 
problem. Whenever transactions are really close together, I am seeing some 
portion of them as ERR_CLIENT_ABORT after 
`proxy.config.http.transaction_active_timeout_in` elapsed time.
   
   Trying to simulate a thundering herd, I made a little script:
   
   ```sh
   #!/bin/sh
   
   url="$1"
   size="$2"
   interval="$3"
   maxtime=9999999 
   
   echo "url: $url size: $size"
   
   i=0
   while [ $i -lt $size ]; do
       i=$(($i + 1))
       echo "start client $i"
   
       case $(uname -s) in
           Darwin)
               # macOS uses BSD style script.
               script /tmp/curl.$i.out curl -m $maxtime -v "$url" -L &
               ;;
           Linux)
               # Assume util-linux style script.
               script -c "curl -m $maxtime -v '$url' -L" /tmp/curl.$i.out &
               ;;
       esac
   
       [ -n "$interval" ] && sleep $interval
   done
   # Be nice to terminal.
   wait
   echo
   ```
   
   This effectively lets me fire off 'n' curls real fast with the curl set to 
not time out for all practical intents and purposes. Enough to do something 
small before reaching for a serious thundering program. I wanted something 
small and simple to observe the Read While Writer behavior, that could give me 
a few hits or a lotta hits.
   
   Using a CentOS 7 VM in Parallels with ATS 9.3.1, even a small request is 
enough to generate this
   
   `./thundering-herd.sh 
"http://cache-vm/MHawk/rolling-buffer/itcrowd/index.m3u8"; 4`
   
   On the VM, I get pretty consistent results like this:
   
   ```
   {"localtimestamp": "17/Nov/2022:14:09:57 -0500", "client_ip": "10.211.55.2", 
"cache_response": "TCP_MISS", "cache_response_code": 200, "request_page": 
"http://cache-vm/MHawk/rolling-buffer/itcrowd/index.m3u8";, "server_response": 
200, "bytes": 161, "user_agent": "curl/7.79.1", "sessionId": "01_2087214833", 
"origin_response_time": 20, "client_response_time": 68, "ssl_connection": 0, 
"referrer": "-"}
   {"localtimestamp": "17/Nov/2022:14:09:57 -0500", "client_ip": "10.211.55.2", 
"cache_response": "ERR_CLIENT_ABORT", "cache_response_code": 000, 
"request_page": "http://cache-vm/MHawk/rolling-buffer/itcrowd/index.m3u8";, 
"server_response": 000, "bytes": 0, "user_agent": "curl/7.79.1", "sessionId": 
"-", "origin_response_time": -1, "client_response_time": 10270, 
"ssl_connection": 0, "referrer": "-"}
   {"localtimestamp": "17/Nov/2022:14:09:57 -0500", "client_ip": "10.211.55.2", 
"cache_response": "ERR_CLIENT_ABORT", "cache_response_code": 000, 
"request_page": "http://cache-vm/MHawk/rolling-buffer/itcrowd/index.m3u8";, 
"server_response": 000, "bytes": 0, "user_agent": "curl/7.79.1", "sessionId": 
"-", "origin_response_time": -1, "client_response_time": 10277, 
"ssl_connection": 0, "referrer": "-"}
   {"localtimestamp": "17/Nov/2022:14:09:57 -0500", "client_ip": "10.211.55.2", 
"cache_response": "ERR_CLIENT_ABORT", "cache_response_code": 000, 
"request_page": "http://cache-vm/MHawk/rolling-buffer/itcrowd/index.m3u8";, 
"server_response": 000, "bytes": 0, "user_agent": "curl/7.79.1", "sessionId": 
"-", "origin_response_time": -1, "client_response_time": 10772, 
"ssl_connection": 0, "referrer": "-"}
   ```
   
   On the VM on my laptop it's pretty consistent that the *lead horse* is a 
miss and the *trailing horses* are simply left hanging. Sometimes it won't be 
75% failure, but almost always is in my VM environment. On a bare metal Dell 
server with ample resources, the failure rate is lower; for 4 requests this 
way, 1 or 2 will be ERR_CLIENT_ABORT instead of all 3 on average. Running my 
script with more horses, has similar effect; i.e., more ERR_CLIENT_ABORTs on 
the VM than on the bare metal server, but not full of success.
   
   Where does this 10 seconds come from?
   
   ```
   CONFIG proxy.config.http.transaction_active_timeout_in INT 10
   ```
   
   I've tried lowering and cranking this up, and it aligns. I.e., if I set the 
active timeout for 180, it's gonna take 3 minutes before the curl dies. 10 
seconds is a nice balance, and results in a client death like this:
   
   ```
   Script started on Thu Nov 17 19:09:57 2022
   command: curl -m 9999999 -v 
http://cache-vm/MHawk/rolling-buffer/itcrowd/index.m3u8 -L
   *   Trying 10.211.55.4:80...
   * Connected to cache-vm (10.211.55.4) port 80 (#0)
   > GET /MHawk/rolling-buffer/itcrowd/index.m3u8 HTTP/1.1
   > Host: cache-vm
   > User-Agent: curl/7.79.1
   > Accept: */*
   > 
   * Empty reply from server
   * Closing connection 0
   curl: (52) Empty reply from server
   
   Script done on Thu Nov 17 19:10:08 2022
   ```
   
   
   I've configured RWW as such:
   
   ```
   CONFIG proxy.config.http.cache.open_write_fail_action INT 5
   CONFIG proxy.config.cache.enable_read_while_writer INT 1
   CONFIG proxy.config.http.background_fill_active_timeout INT 0
   CONFIG proxy.config.http.background_fill_completed_threshold FLOAT 0.000000
   CONFIG proxy.config.http.cache.open_write_fail_action INT 5
   CONFIG proxy.config.cache.read_while_writer.max_retries INT 10
   CONFIG proxy.config.cache.read_while_writer_retry.delay INT 50
   CONFIG proxy.config.http.cache.max_open_read_retries INT 10
   CONFIG proxy.config.http.cache.open_read_retry_time INT 100
   ```
   
   As far as I can tell from the logs, the retry times are not being honored as 
per #9118 .
   
   What I was expecting, was the specified number of read tries at roughly read 
retry time intervals. What I see is this:
   
   
   ```
   [Nov 17 14:09:34.438] [ET_NET 0] DEBUG: <Cache.cc:996 (cacheInitialized)> 
(cache_init) CacheProcessor::cacheInitialized[2] - ram_cache_bytes = 201326592 
= 192Mb
   [Nov 17 14:09:57.547] [ET_NET 1] DEBUG: <HttpTransact.cc:5478 
(set_client_request_state)> (http_trans) [2] [set_client_request_state] set req 
cont length to -1
   [Nov 17 14:09:57.547] [ET_NET 1] DEBUG: <HttpTransact.cc:1350 
(ModifyRequest)> (http_trans) [2] START HttpTransact::ModifyRequest
   [Nov 17 14:09:57.547] [ET_NET 1] DEBUG: <HttpTransact.cc:1424 
(ModifyRequest)> (http_trans) [2] END HttpTransact::ModifyRequest
   [Nov 17 14:09:57.547] [ET_NET 1] DEBUG: <HttpTransact.cc:961 
(StartRemapRequest)> (http_trans) [2] START HttpTransact::StartRemapRequest
   [Nov 17 14:09:57.547] [ET_NET 1] DEBUG: <HttpTransact.cc:981 
(StartRemapRequest)> (http_trans) [2] Before Remapping:
   [Nov 17 14:09:57.547] [ET_NET 1] DEBUG: <HttpTransact.cc:993 
(StartRemapRequest)> (http_trans) [2] END HttpTransact::StartRemapRequest
   [Nov 17 14:09:57.547] [ET_NET 1] DEBUG: <HttpTransact.cc:995 
(StartRemapRequest)> (http_trans) [2] Checking if transaction wants to upgrade
   [Nov 17 14:09:57.548] [ET_NET 1] DEBUG: <HttpTransact.cc:1008 
(PerformRemap)> (http_trans) [2] Inside PerformRemap
   [Nov 17 14:09:57.548] [ET_NET 1] DEBUG: <HttpTransact.cc:1015 
(EndRemapRequest)> (http_trans) [2] START HttpTransact::EndRemapRequest
   [Nov 17 14:09:57.548] [ET_NET 1] DEBUG: <HttpTransact.cc:1021 
(EndRemapRequest)> (http_trans) [2] EndRemapRequest host is cache-vm
   [Nov 17 14:09:57.548] [ET_NET 1] DEBUG: <HttpTransact.cc:1159 
(EndRemapRequest)> (http_trans) [2] After Remapping:
   [Nov 17 14:09:57.548] [ET_NET 1] DEBUG: <HttpTransact.cc:1178 
(EndRemapRequest)> (http_trans) [2] END HttpTransact::EndRemapRequest
   [Nov 17 14:09:57.549] [ET_NET 1] DEBUG: <HttpTransact.cc:1463 
(HandleRequest)> (http_trans) [2] START HttpTransact::HandleRequest
   [Nov 17 14:09:57.549] [ET_NET 1] DEBUG: <HttpTransact.cc:5478 
(set_client_request_state)> (http_trans) [2] [set_client_request_state] set req 
cont length to -1
   [Nov 17 14:09:57.549] [ET_NET 1] DEBUG: <HttpTransact.cc:6389 
(is_request_valid)> (http_trans) [2] [is_request_valid] no request header errors
   [Nov 17 14:09:57.568] [ET_NET 1] DEBUG: <HttpTransact.cc:1901 (OSDNSLookup)> 
(http_trans) [2] [HttpTransact::OSDNSLookup]
   [Nov 17 14:09:57.568] [ET_NET 1] DEBUG: <HttpTransact.cc:1979 (OSDNSLookup)> 
(http_trans) [2] [OSDNSLookup] DNS lookup for O.S. successful IP: 10.211.55.4
   [Nov 17 14:09:57.568] [ET_NET 1] DEBUG: <HttpTransact.cc:2105 
(DecideCacheLookup)> (http_trans) [2] [DecideCacheLookup] Will do cache lookup.
   [Nov 17 14:09:57.568] [ET_NET 1] DEBUG: <HttpCacheSM.cc:100 
(state_cache_open_read)> (http_cache) [2] [&HttpCacheSM::state_cache_open_read, 
CACHE_EVENT_OPEN_READ_FAILED/TS_EVENT_CACHE_OPEN_READ_FAILED]
   [Nov 17 14:09:57.568] [ET_NET 1] DEBUG: <HttpTransact.cc:2365 
(HandleCacheOpenRead)> (http_trans) [2] [HttpTransact::HandleCacheOpenRead]
   [Nov 17 14:09:57.568] [ET_NET 1] DEBUG: <HttpTransact.cc:2395 
(HandleCacheOpenRead)> (http_trans) [2] CacheOpenRead -- miss
   [Nov 17 14:09:57.568] [ET_NET 1] DEBUG: <HttpTransact.cc:3272 
(HandleCacheOpenReadMiss)> (http_trans) [2] [HandleCacheOpenReadMiss] --- MISS
   [Nov 17 14:09:57.568] [ET_NET 1] DEBUG: <HttpTransact.cc:5183 
(get_ka_info_from_config)> (http_trans) [2] get_ka_info_from_config, 
server_info->http_version 1.1, check_hostdb 0
   [Nov 17 14:09:57.588] [ET_NET 1] DEBUG: <HttpTransact.cc:1757 (PPDNSLookup)> 
(http_trans) [2] [HttpTransact::PPDNSLookup]
   [Nov 17 14:09:57.588] [ET_NET 1] DEBUG: <HttpTransact.cc:1791 (PPDNSLookup)> 
(http_trans) [2] [PPDNSLookup] DNS lookup for sm_id[2] successful IP: 
10.211.55.5
   [Nov 17 14:09:57.588] [ET_NET 1] DEBUG: <HttpTransact.cc:5283 
(add_client_ip_to_outgoing_request)> (http_trans) [2] client_ip_set = 0
   [Nov 17 14:09:57.588] [ET_NET 1] DEBUG: <HttpTransact.cc:5293 
(add_client_ip_to_outgoing_request)> (http_trans) [2] inserted request header 
'Client-ip: 10.211.55.2'
   [Nov 17 14:09:57.588] [ET_NET 1] DEBUG: <HttpTransact.cc:5307 
(add_client_ip_to_outgoing_request)> (http_trans) [2] 
[add_client_ip_to_outgoing_request] Appended connecting client's (10.211.55.2) 
to the X-Forwards header
   [Nov 17 14:09:57.588] [ET_NET 1] DEBUG: <HttpTransact.cc:7745 
(build_request)> (http_trans) [2] [build_request] removing host name from url
   [Nov 17 14:09:57.588] [ET_NET 1] DEBUG: <HttpTransact.cc:7758 
(build_request)> (http_trans) [2] [build_request] request like cacheable and 
conditional headers removed
   [Nov 17 14:09:57.588] [ET_NET 1] DEBUG: <HttpTransact.cc:7787 
(build_request)> (http_trans) [2] [build_request] request_sent_time: 1668712197
   [Nov 17 14:09:57.588] [ET_NET 1] DEBUG: <HttpSM.cc:4838 
(do_cache_prepare_action)> (http_cache_write) [2] writing to cache with URL 
MHawk/rolling-buffer/itcrowd/index.m3u8
   [Nov 17 14:09:57.588] [ET_NET 1] DEBUG: <HttpCacheSM.cc:165 
(state_cache_open_write)> (http_cache) [2] 
[&HttpCacheSM::state_cache_open_write, 
CACHE_EVENT_OPEN_WRITE_FAILED/TS_EVENT_CACHE_OPEN_WRITE_FAILED]
   [Nov 17 14:09:57.588] [ET_NET 1] DEBUG: <HttpCacheSM.cc:192 
(state_cache_open_write)> (http_cache) [2] [state_cache_open_write] cache open 
write failure 1. read retry triggered
   [Nov 17 14:09:57.727] [ET_NET 1] DEBUG: <HttpCacheSM.cc:165 
(state_cache_open_write)> (http_cache) [2] 
[&HttpCacheSM::state_cache_open_write, EVENT_INTERVAL]
   [Nov 17 14:09:57.728] [ET_NET 1] DEBUG: <HttpCacheSM.cc:227 
(state_cache_open_write)> (http_cache) [2] [state_cache_open_write] cache open 
write failure 2. falling back to read retry...
   [Nov 17 14:09:57.728] [ET_NET 1] DEBUG: <HttpCacheSM.cc:100 
(state_cache_open_read)> (http_cache) [2] [&HttpCacheSM::state_cache_open_read, 
CACHE_EVENT_OPEN_READ/TS_EVENT_CACHE_OPEN_READ]
   ```
   
   Looking at the code in iocore/cache, it seems that the second write failure 
sets the retry counter to be the max + 1, so that it doesn't keep retrying as a 
writer. That seems to be confirmed by changing the count, and reflect the 
documentation about writers in RWW. But doesn't seem to explain why no response 
occurs until the activity timeout pops.
   
   Turning on inactivity_cop in the debug tags, I even see this; logs taken 
from another run, for ease of finding the right one.
   
   ```
   [Nov 17 14:34:32.918] [ET_NET 0] DEBUG: <UnixNet.cc:78 (check_inactivity)> 
(inactivity_cop) vc: 0x2ae63801e2b0 inactivity timeout not set, setting a 
default of 86400
   [Nov 17 14:34:41.912] [ET_NET 0] DEBUG: <UnixNet.cc:99 (check_inactivity)> 
(inactivity_cop_verbose) active ne: 0x2ae63801e2b0 now: 1668713681 timeout at: 
1668713681578151852 timeout in: 10000000000
   ```
   
   For curl's part, the max time should keep curl looking for a response for 
however long ATS takes to respond. The empty reply from server behavior, is 
similar to what an HTTP server would yield if there was no response bytes from 
a web server's request handler.
   
   **Now here's a funny part**
   
   If I fiddle with my origin, so that response times to ATS are >= 1 second, I 
get 100% successes with this. So it seems like this only happens if each of the 
requests are riding closely together and the origin_response_time is quick. For 
the origin VM on same laptop, that's about 60-90ms; for the origin for the Dell 
server, more like 100~200 ms.
   
   Cranking up the RWW settings in the VM to significantly larger number of 
retries and longer retry delays, seems to have no affect either combined or in 
isolation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to