dataroaring commented on PR #59482:
URL: https://github.com/apache/doris/pull/59482#issuecomment-3774158472
# Code Review: [Opt](cloud) Add segment prefetcher
## Summary
This PR introduces a **segment prefetcher** to optimize cold query
performance when reading data from S3 in cloud deployments. The benchmarks are
impressive: **~4.3x improvement** (2min 47sec → 39sec) for cold queries
fetching 114GB per BE from S3.
The prefetcher uses a **read-ahead windowing strategy** - it maintains
awareness of upcoming blocks based on the row bitmap and triggers async
prefetch when the read position approaches block boundaries. This is
particularly effective for columnar storage where predicate filtering creates
non-contiguous row access patterns.
---
## Architecture Assessment
### Design Strengths
1. **Lazy Initialization**: Prefetching is only enabled for cloud mode and
specific reader types (query/compaction), minimizing overhead for local
deployments.
2. **Weak Reference Pattern** (`cached_remote_file_reader.cpp:627-642`): The
prefetch task captures a `weak_ptr` to avoid preventing destruction of the file
reader - good design.
3. **Complex Type Handling**: Properly handles Map/Array/Struct columns
where nested column access patterns differ from outer offset columns.
---
## Critical Issues
### 1. Race Condition in Block Sequence Access (HIGH)
In `segment_prefetcher.cpp`, `need_prefetch()` modifies `_prefetched_idx`
and `_current_block_idx` without synchronization, while these could potentially
be accessed from prefetch callbacks during verbose logging:
```cpp
bool SegmentPrefetcher::need_prefetch(uint32_t rowid,
std::vector<BlockRange>* ranges) {
// ... modifies _prefetched_idx, _current_block_idx without lock
}
```
**Recommendation**: Either add mutex protection around shared state, or
document that `need_prefetch()` must only be called from the scan thread (and
ensure prefetch callbacks never access this state).
### 2. Thread Pool Starvation Risk (MEDIUM)
Configuration allows up to **2000 threads** in the prefetch pool. With high
concurrency and large `prefetch_block_size`, this could exhaust system
resources.
```cpp
DEFINE_Int64(segment_prefetch_thread_pool_thread_num_max, "2000");
```
**Recommendation**: Consider adding a bounded queue with back-pressure, or
adaptive prefetch rate based on cache hit ratio.
### 3. Missing Error Propagation (MEDIUM)
Prefetch failures are silently logged at VLOG_DEBUG level but never surfaced:
```cpp
if (!st.ok()) {
VLOG_DEBUG << "Failed to submit prefetch task...";
}
```
**Recommendation**: Add metrics/counters for prefetch failures to enable
monitoring.
---
## Code Quality Issues
### 4. Inconsistent Prefetch Window Calculation (`segment_iterator.cpp:573`)
```cpp
int window_size = 1 + (is_query ?
config::query_segment_file_cache_prefetch_block_size
:
config::compaction_segment_file_cache_prefetch_block_size);
```
The `1 +` is confusing. If `prefetch_block_size=2`, this creates a window of
3. The naming suggests "how many blocks to prefetch ahead" but the +1 makes it
"current + N ahead".
**Recommendation**: Either rename the config to `prefetch_window_size` or
document the +1 semantics.
### 5. Potential Dangling IOContext (`cached_remote_file_reader.cpp:618-624`)
The copied `dryrun_ctx` nullifies `query_id`, `file_cache_stats`,
`file_reader_stats` but copies other fields that might contain dangling
references after the parent context is destroyed:
```cpp
IOContext dryrun_ctx;
if (io_ctx != nullptr) {
dryrun_ctx = *io_ctx; // shallow copy
}
dryrun_ctx.query_id = nullptr; // only nullifies some fields
```
**Recommendation**: Review `IOContext` for other pointer/reference members
that need clearing.
---
## Testing Gaps
### 6. No Unit Tests
The PR adds ~1150 lines of new code but no regression or unit tests are
included. Given the complexity of block sequence building, prefetch window
management, and complex type handling, unit tests would be valuable for:
- `SegmentPrefetcher::add_rowids()` with various bitmap patterns
- `need_prefetch()` boundary conditions
- Backward iteration logic
### 7. No Compaction Testing Validation
Compaction prefetch is enabled but the benchmark only covers query
workloads. Compaction has different I/O patterns (sequential, full segment
reads).
---
## Minor Issues
1. **Magic Numbers**: `segment_file_cache_consume_rowids_batch_size = 8000`
lacks rationale documentation.
2. **Verbose Logging Pattern**: `LOG_IF(INFO,
config::enable_segment_prefetch_verbose_log)` is repeated ~20 times - consider
a macro.
---
## Configuration Review
| Config | Default | Assessment |
|--------|---------|------------|
| `enable_query_segment_file_cache_prefetch` | `false` | Good - conservative
default |
| `query_segment_file_cache_prefetch_block_size` | `2` | Reasonable |
| `segment_prefetch_thread_pool_thread_num_max` | `2000` | Potentially too
high |
---
## Verdict
**Overall**: Well-structured optimization with significant performance
benefits for cloud deployments. The core design is sound.
**Recommendations before merge**:
1. Address or document the thread safety model
2. Add basic unit tests for the prefetcher
3. Add metrics for prefetch success/failure rates
4. Consider lowering `segment_prefetch_thread_pool_thread_num_max` default
or adding back-pressure
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]