JackieTien97 opened a new pull request, #815:
URL: https://github.com/apache/tsfile/pull/815

   ## Summary
   
   Optimize `TsFileReader::get_timeseries_metadata()` to eliminate redundant 
index tree traversals and duplicate disk reads when collecting timeseries 
metadata for all devices.
   
   ### Root Causes
   
   1. **N+1 I/O**: `get_all_device_ids()` traverses the entire device index 
tree but discards offset information. Then for each device, 
`load_device_index_entry()` re-searches the tree from the root — O(N×D) 
redundant disk reads for N devices and tree depth D.
   2. **Duplicate read**: `get_device_timeseries_meta_without_chunk_meta()` 
reads the measurement index node once to check alignment, then 
`load_all_measurement_index_entry()` reads the exact same byte range again.
   3. **Redundant PageArena::init()**: Called per-device in a loop despite 
already being initialized in the constructor.
   
   ### Changes
   
   - Add `TsFileIOReader::get_device_timeseries_meta_by_offset()` — accepts 
pre-resolved offsets (skips `load_device_index_entry`), reuses deserialized top 
node for `get_all_leaf()` (eliminates duplicate read)
   - Add `TsFileReader::get_all_device_entries()` — single tree traversal 
collecting device IDs with their `(start_offset, end_offset)`
   - Rewrite `get_timeseries_metadata()` to combine the above
   - Remove redundant `PageArena::init()` call from 
`get_timeseries_metadata_impl()`
   
   ## Benchmark
   
   **Test file**: `ecg_dataset/part_0.tsfile` (634 MB, 53040 devices)
   
   | Path | Avg Time | Speedup |
   |------|----------|---------|
   | **Before** (`get_all_device_ids` + `get_timeseries_metadata(ids)`) | ~28.8 
s | — |
   | **After** (`get_timeseries_metadata()` optimized) | ~0.30 s | **~94x** |
   
   Raw results (5 rounds each, after 2 warmup rounds):
   
   ```
   === Old path (get_all_device_ids + get_timeseries_metadata(ids)) ===
     Round 1: 28508721 us  (devices=53040)
     Round 2: 30220159 us  (devices=53040)
     Round 3: 28359948 us  (devices=53040)
     Round 4: 28497494 us  (devices=53040)
     Round 5: 28451505 us  (devices=53040)
   
   === New path (get_timeseries_metadata(), optimized) ===
     Round 1: 303186 us  (devices=53040)
     Round 2: 303545 us  (devices=53040)
     Round 3: 303869 us  (devices=53040)
     Round 4: 304806 us  (devices=53040)
     Round 5: 304579 us  (devices=53040)
   ```
   
   ## Test plan
   
   - [x] All 522 existing tests pass
   - [x] Verified result correctness: both paths return identical device count 
(53040)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to