bryancall commented on PR #13116: URL: https://github.com/apache/trafficserver/pull/13116#issuecomment-4479671396
Posting a summary of the design discussion plus the prod validation we just ran on e6. ## Why the SERVER_READ divert and not a response transform `TSHttpTxnErrorBodySet()` from a response-stage hook only worked end-to-end when the origin returned an error path; against a valid origin response the SM took the server-transfer branch and the plugin's body was discarded. The fix in this PR is a precise divert in `HttpSM::handle_api_return()` when the plugin set an internal body, the origin response is valid, no `plugin_tunnel` is in play, and the recorded hook id at `TSHttpTxnErrorBodySet()` time was `READ_RESPONSE_HDR` or `SEND_RESPONSE_HDR`. The natural alternative is to do what `txn_box`'s `upstream-rsp-body` does — install a `TS_HTTP_RESPONSE_TRANSFORM_HOOK` and rewrite body bytes inline. We considered that and chose the SERVER_READ divert for these reasons. ### Why the SERVER_READ divert - **Threat-model alignment.** The motivating issue is SSRF leakage of S3 403 XML bodies (PSECBUGS-107834 / PSECBUGS-108568). `setup_internal_transfer()` releases the server session before the response body is consumed, so the sensitive origin body never crosses the cache write boundary or hits disk. A transform-based fix would still stream the leaky bytes through ATS and, with the default `cache_untransformed` behavior, land them in cache. - **Fixes the API, not just one plugin.** Any plugin calling `TSHttpTxnErrorBodySet()` from a response hook gets correct behavior, not only `header_rewrite`. - **Status code and cacheability decisions remain aligned with the origin response.** - **Content-Length and Content-Type are handled by the existing internal-transfer setup.** ### Trade-offs we accept - Touches `HttpSM` core, which is what made review careful. The divert is gated on five conditions (`internal_msg_buffer`, `!api_server_request_body_set`, `server_response.valid()`, `plugin_tunnel == nullptr`, captured-hook-id in the response stage) to keep the blast radius narrow. - `plugin_tunnel == nullptr` is the explicit `TSHttpConnect` intercept guard, replacing the indirect `api_hooks.get()` check. - The captured-hook-id check (`internal_msg_buffer_set_on`) scopes the divert to plugin-driven response replacement. ### Why not a response transform - Body bytes pass through ATS before being rewritten; the leaky origin payload could be persisted to cache by default. Mitigable, but adds configuration that is easy to get wrong. - It is a substantially larger plugin rewrite (transform vconn lifecycle, chunked-encoding handling, Content-Length recomputation, abort cleanup). - It does not address the underlying gap in `TSHttpTxnErrorBodySet()` behavior, so the next plugin to hit this learns the same lesson. ## Production validation Validated on an internal production edge host serving S3-backed routes (Yahoo edge, 10.0.x ATS line with this PR's patches applied; same Apache change set as PR #13116). The host has been running this build for 65+ hours with no incidents prior to the test. **Method:** appended three header-gated `header_rewrite` rules to an S3-backed remap's per-remap `header_rewrite` config (rules only fire when the client sends a magic `X-CDN-Set-Body-Test: 1` header, so real user traffic is unaffected), bumped `remap.config` mtime, ran `traffic_ctl config reload`. Drove ten curl requests with the magic header against a known S3-backed test endpoint (origin returns a 17 KB GIF). Reverted the config and reloaded again. **Result:** | Counter | Before | After | Delta | Expected | |---|---|---|---|---| | `proxy.process.http.origin_body_replaced` | 0 | 10 | +10 | +10 (one per curl) | | `proxy.process.http.origin_body_replaced_in_transform` | 0 | 0 | 0 | 0 (no transform in play) | | `proxy.process.http.origin_body_replaced_pre_transform` | 0 | 0 | 0 | 0 | | `plugin.header_rewrite ... test_set_body_origin.active_match` | 0 | 10 | +10 | +10 | | Curl response size | 17104 B (real GIF) | 41 B (replacement) | replaced | replaced | | Curl HTTP status | 200 | 200 | preserved | preserved | All ten requests returned HTTP 200 with the replacement body (`test-body-from-set-body-origin-experiment`, 41 bytes) instead of the 17 KB origin GIF. After reverting the rule and waiting for reload to propagate, the same curl returned the real GIF and counters held steady, confirming clean teardown with no lingering side effects. The pre-existing `proxy.process.api.error_body_set_response_hook` counter (which counts calls to `TSHttpTxnErrorBodySet()` from any response hook) was already ticking at roughly 36 calls/sec on background production traffic from another plugin (`txn_box`'s `security.txt` rule); that other path uses a response transform and does **not** trigger the new divert. `origin_body_replaced` correctly stayed at zero from background traffic and went up by exactly ten from our test, confirming the divert is scoped to the cases this PR targets. ## Automated test coverage - `header_rewrite_set_body_origin` AuTest passes, including the `null_transform` interaction case (TRANSFORM_READ branch). - Full SDK regression suite passes: `SDK_API_TSHttpConnectIntercept`, `SDK_API_TSHttpConnectServerIntercept`, `SDK_API_HttpAltInfo`, `SDK_API_HttpTxnTransform`, `SDK_API_HttpTxnCache`, `SDK_API_HttpSsn`, `SDK_API_HttpHookAdd`. - Exported symbols of `libtsapi` / `libtscore` / `libtsutil` / `libtscppapi` are unchanged (0 added, 0 removed). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
