bryancall opened a new issue, #12971:
URL: https://github.com/apache/trafficserver/issues/12971

   ## Description
   
   ATS crashes with a failed assertion when a plugin uses 
`TSHttpTxnServerAddrSet()` to set a server address, the connection fails, and 
ATS retries with caching enabled.
   
   ```
   Fatal: src/proxy/http/HttpTransact.cc:3503: failed assertion 
`s->redirect_info.redirect_in_process`
   ```
   
   The crash occurs in `set_cache_prepare_write_action_for_new_request()` which 
is called from `HandleCacheOpenReadMiss()`.
   
   ## Root Cause
   
   The `TSHttpTxnServerAddrSet` retry logic added in #12733 (for issue #12611) 
does not account for the cache write lock state. The flow is:
   
   1. Cache MISS → `OSDNSLookup` routes to `HandleCacheOpenReadMiss`
   2. First pass: `set_cache_prepare_write_action_for_new_request` sets `action 
= PREPARE_TO_WRITE`
   3. SM acquires the cache write lock → `write_lock_state = SUCCESS`, `action 
= WRITE`
   4. Origin connection **fails**
   5. Retry logic fires (line 3936): detects `os_addr_style == USE_API`, resets 
DNS state to `TRY_DEFAULT`, calls `CallOSDNSLookup()`
   6. DNS resolves → `OSDNSLookup` → `cache_lookup_result == MISS` → routes 
back to `HandleCacheOpenReadMiss`
   7. Second pass: `write_lock_state` is already `SUCCESS` → 
`set_cache_prepare_write_action_for_new_request` hits the assertion at line 
3503 because `redirect_in_process == false`
   
   The assertion only expects `write_lock_state == SUCCESS` during redirects, 
but the connection retry is a second legitimate scenario where the write lock 
is already held.
   
   ## Why Tests Didn't Catch This
   
   The autest added in #12733 uses `enable_cache=False`. With caching disabled, 
`OSDNSLookup` routes to `LookupSkipOpenServer` instead of 
`HandleCacheOpenReadMiss`, so the cache write lock path is never exercised.
   
   ## Fix
   
   The assertion in `set_cache_prepare_write_action_for_new_request` needs to 
be relaxed to also accept the retry case. When `api_server_addr_set_retried` is 
true, the cache key is the same (same URL, different server address), so 
reusing the write lock is safe:
   
   ```cpp
   ink_release_assert(s->redirect_info.redirect_in_process || 
s->api_server_addr_set_retried);
   ```
   
   The existing autest should also add a test run with caching enabled.
   
   ## Stack Trace
   
   ```
   
HttpTransact::set_cache_prepare_write_action_for_new_request(HttpTransact::State*)
   HttpTransact::HandleCacheOpenReadMiss(HttpTransact::State*)
   HttpSM::call_transact_and_set_next_state(void (*)(HttpTransact::State*))
   HttpSM::state_api_callout(int, void*)
   HttpSM::state_api_callback(int, void*)
   TSHttpTxnReenable(tsapi_httptxn*, TSEvent)
   <plugin>.so
   INKContInternal::handle_event(int, void*)
   APIHook::invoke(int, void*) const
   HttpSM::state_api_callout(int, void*)
   HttpSM::state_hostdb_lookup(int, void*)
   ```
   
   ## Key State at Crash
   
   ```
   s->redirect_info.redirect_in_process = false
   s->cache_info.write_lock_state = SUCCESS
   s->cache_info.action = WRITE
   s->api_server_addr_set_retried = true
   s->response_error = CONNECTION_OPEN_FAILED
   s->transact_return_point = HandleCacheOpenReadMiss
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to