haohuaijin commented on issue #10140: URL: https://github.com/apache/arrow-rs/issues/10140#issuecomment-4726551760
the repo is in https://github.com/haohuaijin/rle-bitmap-compare 40 files × 5 M rows = 200 M total, ZSTD parquet, narrow-projection top-K SQL: ```sql SELECT service_name, COUNT(*) FROM t WHERE body_<pat> ILIKE '%SVC-C3500%' GROUP BY service_name ORDER BY cnt LIMIT 10 ``` One strategy per process under `/usr/bin/time -l`; `peak heap` is `peak memory footprint`. Three strategies: - `rle` — `ParquetAccessPlan` from `Vec<RowSelector>` (today) - `bitmap` — `ParquetAccessPlan` from a `BooleanBuffer` via the API this issue proposes - `no-index` — keep `body_<pat> ILIKE ...`, `pushdown_filters=false` ### `--pattern fragmented` — hit every 3rd row → `read 1 / skip 2` | strategy | selectors | exec mean | peak heap | |------------|------------:|----------:|----------:| | `rle` | 133,333,854 | 1441 ms | 7.95 GB | | `bitmap` | — | 481 ms | 201 MB | | `no-index` | — | 4782 ms | 269 MB | ### `--pattern clustered` — jittered 8 K-row runs, the case Parquet's happy with | strategy | selectors | exec mean | peak heap | |------------|----------:|----------:|----------:| | `rle` | 17,834 | 145 ms | 162 MB | | `bitmap` | — | 246 ms | 193 MB | | `no-index` | — | 4530 ms | 260 MB | in the fragmented case bitmap is **3× faster** (481 vs 1441 ms) and uses **40× less memory** (201 MB vs 7.95 GB). It's RLE that pays the per-row cost — 133 M selectors processed one by one; bitmap just hands the decoder one buffer per row group. For clustered runs the per-run overhead amortizes over 8 K rows, so RLE wins narrowly (1.7× faster, ~17% less memory). Bitmap stays within ~2× — fine as a default. `no-index` is the floor without an access plan — body still has to be decoded for the ILIKE, so it's 3–30× slower than the indexed paths. Without bitmap, fragmented data forces a choice between 7.95 GB of RLE or 5 s of no-index. 500 M rows (`--files 100`) shows the RLE cliff: fragmented RLE goes superlinear (17.10 GB / 5.85 s, 4× exec for 2.5× data); bitmap stays linear (272 MB / 1.29 s). you can check 500M result in https://github.com/haohuaijin/rle-bitmap-compare cc @2010YOUY01 @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
