Hastyshell opened a new pull request, #61386:
URL: https://github.com/apache/doris/pull/61386
## Proposed changes
The three-tier retry logic in `Segment::_open()` (static method) was
structured as `if-else-if`, so when `open_file()` succeeded but
`_parse_footer()` returned `CORRUPTION`, the retry branch was unreachable.
### Root cause
```cpp
// Before: if-else-if structure
auto st = fs->open_file(path, &file_reader, &reader_options);
if (st) { // open_file succeeded (almost
always)
segment->_file_reader = ...;
st = segment->_open(stats); // _parse_footer() → CORRUPTION
stored in st
} else if (st.is<CORRUPTION>() && ...) { // UNREACHABLE: already entered
`if` branch
// Tier 1: clear cache, retry
// Tier 2: bypass cache, read remote directly
}
RETURN_IF_ERROR(st); // CORRUPTION returned without any retry
```
`open_file()` only opens a file handle and rarely returns `CORRUPTION`. The
actual footer checksum validation happens inside `_parse_footer()` (called via
`segment->_open()`). Because the retry was in an `else if` guarded by the same
`st` from `open_file()`, it was never reachable for the common
`_parse_footer()` corruption case.
### Fix
Change `else if` to a separate `if` block, so CORRUPTION from either
`open_file()` or `_parse_footer()` triggers the three-tier retry:
```cpp
// After: independent if blocks
if (st) {
segment->_file_reader = ...;
st = segment->_open(stats); // _parse_footer() → CORRUPTION stored in
st
}
// NOW reachable regardless of where CORRUPTION came from
if (st.is<CORRUPTION>() && reader_options.cache_type == FILE_BLOCK_CACHE) {
// Tier 1: clear file cache, re-download from remote
// Tier 2: bypass cache entirely, read remote directly
// Tier 3: remote source itself corrupt (logs warning)
}
```
### Issue
Observed in cloud (S3) deployments: schema change fails with `CORRUPTION:
Bad segment file footer checksum not match`. Log analysis confirmed that no
retry log messages were ever emitted, consistent with this code bug.
## Tests
- Added `TestFooterCorruptionTriggersRetry` to `segment_corruption_test.cpp`
- Uses the existing `Segment::parse_footer:magic_number_corruption` sync
point to corrupt the footer magic number on the first `_parse_footer()` call
only (simulating file cache corruption)
- Verifies the segment opens successfully via the retry path
## Checklist
- [x] Does it affect the original behavior? Yes — previously CORRUPTION from
`_parse_footer()` was never retried; now it correctly follows the three-tier
retry
- [x] Has unit test been added? Yes
- [x] Is this a backport? This fix should also be applied to `branch-4.0`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]