zhuqi-lucas commented on PR #9937: URL: https://github.com/apache/arrow-rs/pull/9937#issuecomment-4397057421
Following up — added a **cross-page** version of the comparison bench (`0b52384`) so the speedup shape is visible across the full range of `n`, not only `n <= one page`. The new function (`time_to_first_n_page_reverse_cross_page`) reverses each page segment individually before continuing, which is correct for any `n` and matches the algorithmic shape Phase 2 will use (per-page reverse + emit, no cross-page accumulation). Numbers on Apple M-series (`--quick`, 100k INT32, 98 pages, no dict, uncompressed): | n | row_group_sim | page_reverse(_cross) | Speedup | |---:|---:|---:|---:| | 10 (single page) | 26.7 µs | 565 ns | ~47x | | 100 (single page) | 26.7 µs | 565 ns | ~47x | | 1024 (single page) | 26.7 µs | 472 ns | ~57x | | **2 000 (~2 pages)** | 26.8 µs | **770 ns** | **~35x** | | **10 000 (~10 pages)** | 26.6 µs | **2.85 µs** | **~9.3x** | | **50 000 (~half file)** | 26.7 µs | **14.5 µs** | **~1.8x** | Speedup follows `total_pages / pages_needed_for_n` and converges to ~1x as `n` approaches the full chunk — the expected shape for a primitive that saves work proportional to the unread tail. The original single-trailing-reverse function (`time_to_first_n_page_reverse`) is left in place for the small-`n` cases where it produces the same result, with a doc comment noting it is only correct when `n <= PAGE_ROW_COUNT_LIMIT`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
