dataroaring commented on PR #59482:
URL: https://github.com/apache/doris/pull/59482#issuecomment-3774158472

   # Code Review: [Opt](cloud) Add segment prefetcher
   
   ## Summary
   
   This PR introduces a **segment prefetcher** to optimize cold query 
performance when reading data from S3 in cloud deployments. The benchmarks are 
impressive: **~4.3x improvement** (2min 47sec → 39sec) for cold queries 
fetching 114GB per BE from S3.
   
   The prefetcher uses a **read-ahead windowing strategy** - it maintains 
awareness of upcoming blocks based on the row bitmap and triggers async 
prefetch when the read position approaches block boundaries. This is 
particularly effective for columnar storage where predicate filtering creates 
non-contiguous row access patterns.
   
   ---
   
   ## Architecture Assessment
   
   ### Design Strengths
   
   1. **Lazy Initialization**: Prefetching is only enabled for cloud mode and 
specific reader types (query/compaction), minimizing overhead for local 
deployments.
   
   2. **Weak Reference Pattern** (`cached_remote_file_reader.cpp:627-642`): The 
prefetch task captures a `weak_ptr` to avoid preventing destruction of the file 
reader - good design.
   
   3. **Complex Type Handling**: Properly handles Map/Array/Struct columns 
where nested column access patterns differ from outer offset columns.
   
   ---
   
   ## Critical Issues
   
   ### 1. Race Condition in Block Sequence Access (HIGH)
   
   In `segment_prefetcher.cpp`, `need_prefetch()` modifies `_prefetched_idx` 
and `_current_block_idx` without synchronization, while these could potentially 
be accessed from prefetch callbacks during verbose logging:
   
   ```cpp
   bool SegmentPrefetcher::need_prefetch(uint32_t rowid, 
std::vector<BlockRange>* ranges) {
       // ... modifies _prefetched_idx, _current_block_idx without lock
   }
   ```
   
   **Recommendation**: Either add mutex protection around shared state, or 
document that `need_prefetch()` must only be called from the scan thread (and 
ensure prefetch callbacks never access this state).
   
   ### 2. Thread Pool Starvation Risk (MEDIUM)
   
   Configuration allows up to **2000 threads** in the prefetch pool. With high 
concurrency and large `prefetch_block_size`, this could exhaust system 
resources.
   
   ```cpp
   DEFINE_Int64(segment_prefetch_thread_pool_thread_num_max, "2000");
   ```
   
   **Recommendation**: Consider adding a bounded queue with back-pressure, or 
adaptive prefetch rate based on cache hit ratio.
   
   ### 3. Missing Error Propagation (MEDIUM)
   
   Prefetch failures are silently logged at VLOG_DEBUG level but never surfaced:
   
   ```cpp
   if (!st.ok()) {
       VLOG_DEBUG << "Failed to submit prefetch task...";
   }
   ```
   
   **Recommendation**: Add metrics/counters for prefetch failures to enable 
monitoring.
   
   ---
   
   ## Code Quality Issues
   
   ### 4. Inconsistent Prefetch Window Calculation (`segment_iterator.cpp:573`)
   
   ```cpp
   int window_size = 1 + (is_query ? 
config::query_segment_file_cache_prefetch_block_size
                                   : 
config::compaction_segment_file_cache_prefetch_block_size);
   ```
   
   The `1 +` is confusing. If `prefetch_block_size=2`, this creates a window of 
3. The naming suggests "how many blocks to prefetch ahead" but the +1 makes it 
"current + N ahead".
   
   **Recommendation**: Either rename the config to `prefetch_window_size` or 
document the +1 semantics.
   
   ### 5. Potential Dangling IOContext (`cached_remote_file_reader.cpp:618-624`)
   
   The copied `dryrun_ctx` nullifies `query_id`, `file_cache_stats`, 
`file_reader_stats` but copies other fields that might contain dangling 
references after the parent context is destroyed:
   
   ```cpp
   IOContext dryrun_ctx;
   if (io_ctx != nullptr) {
       dryrun_ctx = *io_ctx;  // shallow copy
   }
   dryrun_ctx.query_id = nullptr;  // only nullifies some fields
   ```
   
   **Recommendation**: Review `IOContext` for other pointer/reference members 
that need clearing.
   
   ---
   
   ## Testing Gaps
   
   ### 6. No Unit Tests
   
   The PR adds ~1150 lines of new code but no regression or unit tests are 
included. Given the complexity of block sequence building, prefetch window 
management, and complex type handling, unit tests would be valuable for:
   - `SegmentPrefetcher::add_rowids()` with various bitmap patterns
   - `need_prefetch()` boundary conditions
   - Backward iteration logic
   
   ### 7. No Compaction Testing Validation
   
   Compaction prefetch is enabled but the benchmark only covers query 
workloads. Compaction has different I/O patterns (sequential, full segment 
reads).
   
   ---
   
   ## Minor Issues
   
   1. **Magic Numbers**: `segment_file_cache_consume_rowids_batch_size = 8000` 
lacks rationale documentation.
   
   2. **Verbose Logging Pattern**: `LOG_IF(INFO, 
config::enable_segment_prefetch_verbose_log)` is repeated ~20 times - consider 
a macro.
   
   ---
   
   ## Configuration Review
   
   | Config | Default | Assessment |
   |--------|---------|------------|
   | `enable_query_segment_file_cache_prefetch` | `false` | Good - conservative 
default |
   | `query_segment_file_cache_prefetch_block_size` | `2` | Reasonable |
   | `segment_prefetch_thread_pool_thread_num_max` | `2000` | Potentially too 
high |
   
   ---
   
   ## Verdict
   
   **Overall**: Well-structured optimization with significant performance 
benefits for cloud deployments. The core design is sound.
   
   **Recommendations before merge**:
   1. Address or document the thread safety model
   2. Add basic unit tests for the prefetcher
   3. Add metrics for prefetch success/failure rates
   4. Consider lowering `segment_prefetch_thread_pool_thread_num_max` default 
or adding back-pressure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to