parallelism)

ColinLee Tue, 16 Jun 2026 08:14:55 -0700

Hi all,

I have been working on a branch that optimizes the **read and write paths of the
C++ module**, and I would like to share the work, the benchmark results, and ask
for review.



## 1. What the branch does
The goal was to reduce per-value overhead on the hot paths and to take advantage
of SIMD and multiple cores. The changes fall into four groups.

**Read path**
- Decoders gain batch APIs (`read_batch_int32/int64/float/double`, `skip_*`),
  implemented for PLAIN, TS2DIFF and Gorilla. TS2DIFF supports block-level
  peeking so a time filter can skip an entire block without decoding it; Gorilla
  adds a raw-pointer bit reader that bypasses the `ByteStream` overhead.
- `ChunkReader` / `AlignedChunkReader` add `*_DECODE_TV_BATCH` methods that
  decode time + value into a `TsBlock` in a single pass and apply the batch time
  filter before appending.
- Aligned multi-value mode: one time chunk + N value chunks are decoded in one
  pass, sharing the decoded timestamps and the filter mask.
- Optional **chunk-level parallel decode** via a single process-wide worker pool
  (enabled with `ENABLE_THREADS`): for a multi-column aligned read, one task per
  value column decodes that column's whole chunk up front, with a per-worker 
time
  decoder/compressor pool parallelizing the time-page decode. Single-column 
reads
  (or no pool) fall back to a serial inline decode.

**Write path**
- `ValuePageWriter` gains `write_batch` / `write_string_batch` that take
  timestamp + value + null-flag arrays directly, removing the per-value append
  loop. `Tablet` exposes bulk set/reset APIs for reuse.
- `TS2DIFFEncoder::flush` packs all deltas with a single bit-pack instead of
  per-value `write_bits`.
- Batched, NEON-accelerated `Int64Statistic` min/max/sum updates.

**Encoding / SIMD**
- TS2DIFF batch decode uses AVX2 (via SIMDe) for i32/i64, with a scalar 
fallback.
- PLAIN byte-swap uses ARM NEON when available, falling back to 
`__builtin_bswap`.
- Release builds can enable `-O3 -march=native -flto` (`ENABLE_SIMD`), and these
  are automatically dropped under ASan / on Windows/MinGW.

All changes keep the on-disk TsFile format unchanged and remain interoperable
with the Java and Python implementations.

## 2. Performance results

**Machine**: Apple M3, 8 cores, 16 GB, macOS 15.7.4. C++ built Release with 
clang
17 `-O3 -march=native -flto`; Java on OpenJDK 17 (Corretto), `-Xmx6g`.

**Workload**: 5,000,000 rows, one device, N INT64 FIELD columns (N ∈ {4, 8, 16})
plus one STRING tag, xorshift-random values (the *same* sequence for all three
implementations), tablet / batch size 65,536. Throughput is in **million 
rows/s**
(higher is better).

**Comparison style**: this is deliberately *not* a controlled API match — each
implementation uses its own fastest idiomatic path:

- **this branch** — bulk column write + `TsBlock` batch read + SIMD + thread 
pool
- **develop** — bulk column write + `TsBlock` batch read, single-thread, no SIMD
- **Java** — `Tablet` batch write + `ResultSet` block-cursor read 
(single-thread)

The **encoding + compression are pinned identical across all three**, reported
for two settings below. `java` and `develop` are single-thread baselines; each
`current_Nt` cell is this branch at N threads, annotated `(×java / ×develop)` —
its speedup over the two single-thread baselines.

### 2.1 TS_2DIFF + LZ4

**Write**
| Cols | java | develop | current_1t | current_2t | current_4t | current_8t |
|-----:|-----:|--------:|-----------:|-----------:|-----------:|-----------:|
| 4  | 1.7 | 17.7 | 44.8 (25.7/2.5) | 60.1 (34.5/3.4) | 73.7 (42.3/4.2) | 72.2 
(41.4/4.1) |
| 8  | 2.3 |  9.8 | 25.9 (11.4/2.7) | 36.9 (16.2/3.8) | 49.5 (21.7/5.1) | 54.1 
(23.7/5.5) |
| 16 | 1.3 |  5.4 | 14.2 (10.6/2.6) | 21.1 (15.8/3.9) | 30.9 (23.2/5.8) | 34.4 
(25.8/6.4) |

**Read**
| Cols | java | develop | current_1t | current_2t | current_4t | current_8t |
|-----:|-----:|--------:|-----------:|-----------:|-----------:|-----------:|
| 4  | 7.5 | 9.3 | 67.4 (9.0/7.2) | 99.8 (13.4/10.7) | 127.4 (17.1/13.7) | 
144.2 (19.3/15.5) |
| 8  | 4.7 | 4.8 | 35.8 (7.7/7.5) | 44.9 ( 9.6/ 9.4) |  78.9 (16.9/16.4) |  
91.8 (19.7/19.1) |
| 16 | 2.1 | 2.4 | 19.6 (9.5/8.0) | 26.6 (13.0/10.9) |  41.0 (20.0/16.7) |  
46.0 (22.5/18.8) |

### 2.2 PLAIN + uncompressed (codec-isolated cross-check)

**Write**
| Cols | java | develop | current_1t | current_2t | current_4t | current_8t |
|-----:|-----:|--------:|-----------:|-----------:|-----------:|-----------:|
| 4  | 3.6 | 18.2 | 43.9 (12.2/2.4) | 60.8 (16.9/3.3) | 72.4 (20.1/4.0) | 75.0 
(20.8/4.1) |
| 8  | 2.4 | 10.1 | 26.1 (10.8/2.6) | 38.1 (15.8/3.8) | 50.4 (20.9/5.0) | 52.7 
(21.9/5.2) |
| 16 | 1.3 |  4.9 | 14.6 (11.6/3.0) | 22.3 (17.8/4.6) | 31.5 (25.1/6.5) | 36.6 
(29.2/7.5) |

**Read**
| Cols | java | develop | current_1t | current_2t | current_4t | current_8t |
|-----:|-----:|--------:|-----------:|-----------:|-----------:|-----------:|
| 4  | 8.6 | 9.2 | 69.6 (8.1/7.6) | 88.0 (10.2/ 9.6) | 142.1 (16.5/15.5) | 
160.2 (18.6/17.4) |
| 8  | 4.8 | 4.1 | 37.7 (7.9/9.2) | 52.3 (11.0/12.8) |  80.7 (17.0/19.7) |  
81.4 (17.1/19.9) |
| 16 | 2.8 | 2.4 | 19.9 (7.1/8.1) | 27.8 (10.0/11.4) |  43.5 (15.6/17.8) |  
49.5 (17.7/20.2) |

### Takeaways

- **Single-thread** (current_1t vs develop): writes ≈ **2.4–3x**, reads ≈
**7–9x** faster. The read gap is the core of this work — batch decode + SIMD +
  single-pass scatter into `TsBlock`.
- **Multi-thread**: at 8 threads on 8 cores the branch reaches roughly 
**2.3–2.5x**
its own single-thread read on wide chunks (16 cols), ≈ **2.2x** on 8 cols;
  writes scale ≈ **2–3x** (flattening past 4 threads — IO / serial page tail).
- **Encoding barely moves the C++ numbers** (PLAIN vs TS_2DIFF on random data,
  which neither encodes nor compresses much): the bottleneck is decode + 
scatter,
  not the codec — which is exactly what the batch + SIMD path targets.
- Java write is slower largely because the public `Tablet.addValue` path builds
  rows cell-by-cell (vs the C++ bulk column memcpy); that is Java's idiomatic
  best with the v4 table API.

### PR
The pr is https://github.com/apache/tsfile/pull/823.

Feedback and review is very welcome, especially on:
- the batch encode/decode implementation (PLAIN / TS2DIFF / Gorilla) and the
  batch null/filter handling,
- the parallelization implementation (single global worker pool, chunk-level
  parallel decode, column-parallel encode),
- the SIMD / `-march=native` build switches and their fallbacks,
- the benchmark methodology.


Best regards,
Colin Li

[DISCUSS] C++ module read/write performance optimization (batching / SIMD / parallelism)

Reply via email to