rohityadav1993 opened a new pull request, #18467: URL: https://github.com/apache/pinot/pull/18467
## Summary `PredownloadScheduler.loadSegmentsFromLocal()` previously opened a full `SegmentLocalFSDirectory` with `ReadMode.mmap` for each segment to read its CRC. For a segment with C columns and ~5 index types, this triggers ~C×5 major page faults per segment (one per index chunk boundary validated by `SingleFileIndexDirectory.mapBufferEntries`). With P parallel threads, the total page fault rate scales to ~C×5×P per batch — on hosts with thousands of segments this creates a page fault storm at startup. The CRC check exists to determine whether a segment has already been downloaded to local disk from deep store, so it can be skipped during predownload. The CRC value is pre-written to `creation.meta` (8 bytes) during segment build and does not require mmapping index files. This PR replaces the `SegmentDirectory` open with a direct read of `creation.meta` via `DataInputStream`, and replaces `getDiskSizeBytes()` with `FileUtils.sizeOfDirectory()` for the post-download size metric. Both Phase 1 (local CRC check) and Phase 2 (post-download size reporting) now use this optimised path through `loadSegmentFromLocal`. If a segment's CRC matches but the local segment directory is corrupted (e.g. truncated or missing index files), the server startup path handles recovery: `BaseTableDataManager.addNewOnlineSegment` calls `tryLoadExistingSegment`, which performs a full segment load. If loading fails, it falls back to `downloadAndLoadSegment` to re-fetch the segment from deep store or peers. This PR also adds a `Preconditions.checkArgument(segDir.isDirectory())` guard in `updateSegmentInfoFromLocal()` to enforce the directory contract explicitly, with corresponding test updates. ## Problem When `PredownloadScheduler` checks if segments already exist locally (Phase 1), it opens each segment directory via `SegmentLocalFSDirectory(dir, ReadMode.mmap)`. This triggers `SingleFileIndexDirectory.mapBufferEntries()` which mmaps the entire index file and validates magic markers at every chunk boundary — causing a major page fault per index entry (~C×5 per segment, where C is the number of columns). With P concurrent threads across thousands of segments, the aggregate fault rate becomes ~C×5×P, producing a page fault storm that stalls predownload startup. After downloading segments (Phase 2), the scheduler calls `loadSegmentFromLocal()` again solely to populate the local size metric — mmapping a freshly-written cold segment and causing another round of major page faults per segment. ## Fix **Phase 1 — CRC check**: The CRC is only needed to check whether a segment was already downloaded to local disk. Replace `SegmentDirectory` open with a direct 8-byte read from `creation.meta`. Zero mmap, zero major page faults. **Phase 2 — post-download size**: Replace `getDiskSizeBytes()` with `FileUtils.sizeOfDirectory()`. No mmap needed for a directory size walk. **Preconditions**: Add `Preconditions.checkArgument(segDir.isDirectory())` in `updateSegmentInfoFromLocal()` to fail fast on invalid paths. `loadSegmentFromLocal()` now checks `segDir.isDirectory()` before delegating. ## Cluster Testing Results Tested on hosts with ~5TB of segment data (thousands of segments), 48 cores, and 266GB memory. Comparing stream+untar configurations to isolate the download path overhead: | Setup | Pagefault Fix | Predownload Duration | iowait | Dirty Pages (avg) | Major Faults Δ | |---|---|---|---|---|---| | S1 — 48 parallelism + stream+untar | No | 33 min | 0.05% | 0.1 GB | Near-zero baseline | | S5 — 48 parallelism + stream+untar + fix | **Yes** | **31 min** | **0.2%** | **~0 GB** | **−3% (flat)** | Without stream+untar (isolating the page fault fix impact): | Setup | Pagefault Fix | Predownload Duration | iowait | Dirty Pages (avg) | Major Faults Δ | |---|---|---|---|---|---| | S3 — 48 parallelism | No | 79 min | 29% | 132 GB | +∞ (from 0, peak 10K/s) | | S4 — 48 parallelism + fix | **Yes** | **63 min** | **8.0%** | **113 GB** | **−10% (flat)** | Key findings: - **Pagefault fix prevents mmap during predownload** — major faults stay flat or decrease during predownload (S4: −10%, S5: −3%) - **Without the fix**, major fault spikes up to 10,011/s during predownload (S3) - **Combined with stream+untar**, the fix delivers the best result: 31 min predownload, ~0% iowait, zero dirty-page backlog (S5) - **Serving disk reads are completely isolated** from predownload writes in all setups (+0-1%) ## Test plan - [x] Unit tests for `PredownloadSegmentInfoTest` — updated to verify `Preconditions` throws on non-directory paths and regular files - [x] Unit tests for `PredownloadTableInfoTest` — existing tests cover CRC match/mismatch and missing segment scenarios - [x] All 7 tests pass: `./mvnw -pl pinot-server -am -Dtest=PredownloadSegmentInfoTest,PredownloadTableInfoTest test` - [x] Cluster testing on 5TB hosts validates page fault elimination 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
