rohityadav1993 opened a new pull request, #18467:
URL: https://github.com/apache/pinot/pull/18467

   ## Summary
   
   `PredownloadScheduler.loadSegmentsFromLocal()` previously opened a full 
`SegmentLocalFSDirectory` with `ReadMode.mmap` for each segment to read its 
CRC. For a segment with C columns and ~5 index types, this triggers ~C×5 major 
page faults per segment (one per index chunk boundary validated by 
`SingleFileIndexDirectory.mapBufferEntries`). With P parallel threads, the 
total page fault rate scales to ~C×5×P per batch — on hosts with thousands of 
segments this creates a page fault storm at startup.
   
   The CRC check exists to determine whether a segment has already been 
downloaded to local disk from deep store, so it can be skipped during 
predownload. The CRC value is pre-written to `creation.meta` (8 bytes) during 
segment build and does not require mmapping index files. This PR replaces the 
`SegmentDirectory` open with a direct read of `creation.meta` via 
`DataInputStream`, and replaces `getDiskSizeBytes()` with 
`FileUtils.sizeOfDirectory()` for the post-download size metric. Both Phase 1 
(local CRC check) and Phase 2 (post-download size reporting) now use this 
optimised path through `loadSegmentFromLocal`.
   
   If a segment's CRC matches but the local segment directory is corrupted 
(e.g. truncated or missing index files), the server startup path handles 
recovery: `BaseTableDataManager.addNewOnlineSegment` calls 
`tryLoadExistingSegment`, which performs a full segment load. If loading fails, 
it falls back to `downloadAndLoadSegment` to re-fetch the segment from deep 
store or peers.
   
   This PR also adds a `Preconditions.checkArgument(segDir.isDirectory())` 
guard in `updateSegmentInfoFromLocal()` to enforce the directory contract 
explicitly, with corresponding test updates.
   
   ## Problem
   
   When `PredownloadScheduler` checks if segments already exist locally (Phase 
1), it opens each segment directory via `SegmentLocalFSDirectory(dir, 
ReadMode.mmap)`. This triggers `SingleFileIndexDirectory.mapBufferEntries()` 
which mmaps the entire index file and validates magic markers at every chunk 
boundary — causing a major page fault per index entry (~C×5 per segment, where 
C is the number of columns). With P concurrent threads across thousands of 
segments, the aggregate fault rate becomes ~C×5×P, producing a page fault storm 
that stalls predownload startup.
   
   After downloading segments (Phase 2), the scheduler calls 
`loadSegmentFromLocal()` again solely to populate the local size metric — 
mmapping a freshly-written cold segment and causing another round of major page 
faults per segment.
   
   ## Fix
   
   **Phase 1 — CRC check**: The CRC is only needed to check whether a segment 
was already downloaded to local disk. Replace `SegmentDirectory` open with a 
direct 8-byte read from `creation.meta`. Zero mmap, zero major page faults.
   
   **Phase 2 — post-download size**: Replace `getDiskSizeBytes()` with 
`FileUtils.sizeOfDirectory()`. No mmap needed for a directory size walk.
   
   **Preconditions**: Add `Preconditions.checkArgument(segDir.isDirectory())` 
in `updateSegmentInfoFromLocal()` to fail fast on invalid paths. 
`loadSegmentFromLocal()` now checks `segDir.isDirectory()` before delegating.
   
   ## Cluster Testing Results
   
   Tested on hosts with ~5TB of segment data (thousands of segments), 48 cores, 
and 266GB memory. Comparing stream+untar configurations to isolate the download 
path overhead:
   
   | Setup | Pagefault Fix | Predownload Duration | iowait | Dirty Pages (avg) 
| Major Faults Δ |
   |---|---|---|---|---|---|
   | S1 — 48 parallelism + stream+untar | No | 33 min | 0.05% | 0.1 GB | 
Near-zero baseline |
   | S5 — 48 parallelism + stream+untar + fix | **Yes** | **31 min** | **0.2%** 
| **~0 GB** | **−3% (flat)** |
   
   Without stream+untar (isolating the page fault fix impact):
   
   | Setup | Pagefault Fix | Predownload Duration | iowait | Dirty Pages (avg) 
| Major Faults Δ |
   |---|---|---|---|---|---|
   | S3 — 48 parallelism | No | 79 min | 29% | 132 GB | +∞ (from 0, peak 10K/s) 
|
   | S4 — 48 parallelism + fix | **Yes** | **63 min** | **8.0%** | **113 GB** | 
**−10% (flat)** |
   
   Key findings:
   - **Pagefault fix prevents mmap during predownload** — major faults stay 
flat or decrease during predownload (S4: −10%, S5: −3%)
   - **Without the fix**, major fault spikes up to 10,011/s during predownload 
(S3)
   - **Combined with stream+untar**, the fix delivers the best result: 31 min 
predownload, ~0% iowait, zero dirty-page backlog (S5)
   - **Serving disk reads are completely isolated** from predownload writes in 
all setups (+0-1%)
   
   ## Test plan
   
   - [x] Unit tests for `PredownloadSegmentInfoTest` — updated to verify 
`Preconditions` throws on non-directory paths and regular files
   - [x] Unit tests for `PredownloadTableInfoTest` — existing tests cover CRC 
match/mismatch and missing segment scenarios
   - [x] All 7 tests pass: `./mvnw -pl pinot-server -am 
-Dtest=PredownloadSegmentInfoTest,PredownloadTableInfoTest test`
   - [x] Cluster testing on 5TB hosts validates page fault elimination
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to