Yukang-Lian opened a new pull request, #61636:
URL: https://github.com/apache/doris/pull/61636

   ## Summary
   
   - Add bounded retry (max 3) for S3 `Failed to flush response stream` 
transient errors that AWS SDK misclassifies as non-retryable `INTERNAL_FAILURE`
   - Fix misleading `request_id=failed to read` error message
   - Add observability metrics (bvar/profile) for transient error retries
   
   ## Problem
   
   When AWS SDK's `CurlHttpClient` encounters a stream flush failure after 
`curl_easy_perform` returns OK, it sets the error type to `INTERNAL_FAILURE` 
instead of `NETWORK_CONNECTION`. This causes:
   1. SDK internal retry (`S3CustomRetryStrategy`) to skip — `INTERNAL_FAILURE` 
is marked non-retryable
   2. Doris application retry (`s3_file_reader.cpp`) to skip — only retries 
HTTP 429
   
   A transient, recoverable error is directly exposed to users as a query 
failure.
   
   Related issues: CORE-5789, CIR-19680, CIR-19630
   
   ## Changes
   
   | File | Change |
   |------|--------|
   | `obj_storage_client.h` | Add `error_type` and `is_retriable` fields to 
`ObjectStorageResponse` |
   | `s3_obj_storage_client.cpp` | Compute `is_retriable` via four-condition 
heuristic in `get_object()` |
   | `s3_file_reader.cpp` | Add transient error retry branch (max 3), debug 
point, bvar, profile counters; remove `.append("failed to read")` |
   | `s3_file_reader.h` | Add transient error stats to `S3Statistics` |
   | `err_utils.cpp` | Print `request_id=<empty>` instead of empty string in 
all `s3fs_error()` branches |
   | `s3_obj_stroage_client_mock_test.cpp` | 5 mock unit tests for 
`is_retriable` classification |
   | `s3_file_reader_retry_test.cpp` | 5 unit tests for reader retry behavior 
using fake `ObjStorageClient` |
   | `test_s3_read_transient_error_retry.groovy` | Regression test with debug 
point injection |
   
   ## Test plan
   - [x] BE unit tests: `S3ObjStorageClientMockTest.*` (is_retriable 
classification)
   - [x] BE unit tests: `S3FileReaderRetryTest.*` (reader retry behavior)
   - [ ] Regression test: `test_s3_read_transient_error_retry` (requires S3 + 
debug points)
   - [x] Manual reproduction: MinIO + proxy injection verified flush error and 
retry-after-success
   
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to