Hi all,
I have been working on a branch that optimizes the **read and write paths of the
C++ module**, and I would like to share the work, the benchmark results, and ask
for review.
## 1. What the branch does
The goal was to reduce per-value overhead on the hot paths and to take advantage
of SIMD and multiple cores. The changes fall into four groups.
**Read path**
- Decoders gain batch APIs (`read_batch_int32/int64/float/double`, `skip_*`),
implemented for PLAIN, TS2DIFF and Gorilla. TS2DIFF supports block-level
peeking so a time filter can skip an entire block without decoding it; Gorilla
adds a raw-pointer bit reader that bypasses the `ByteStream` overhead.
- `ChunkReader` / `AlignedChunkReader` add `*_DECODE_TV_BATCH` methods that
decode time + value into a `TsBlock` in a single pass and apply the batch time
filter before appending.
- Aligned multi-value mode: one time chunk + N value chunks are decoded in one
pass, sharing the decoded timestamps and the filter mask.
- Optional **chunk-level parallel decode** via a single process-wide worker pool
(enabled with `ENABLE_THREADS`): for a multi-column aligned read, one task per
value column decodes that column's whole chunk up front, with a per-worker
time
decoder/compressor pool parallelizing the time-page decode. Single-column
reads
(or no pool) fall back to a serial inline decode.
**Write path**
- `ValuePageWriter` gains `write_batch` / `write_string_batch` that take
timestamp + value + null-flag arrays directly, removing the per-value append
loop. `Tablet` exposes bulk set/reset APIs for reuse.
- `TS2DIFFEncoder::flush` packs all deltas with a single bit-pack instead of
per-value `write_bits`.
- Batched, NEON-accelerated `Int64Statistic` min/max/sum updates.
**Encoding / SIMD**
- TS2DIFF batch decode uses AVX2 (via SIMDe) for i32/i64, with a scalar
fallback.
- PLAIN byte-swap uses ARM NEON when available, falling back to
`__builtin_bswap`.
- Release builds can enable `-O3 -march=native -flto` (`ENABLE_SIMD`), and these
are automatically dropped under ASan / on Windows/MinGW.
All changes keep the on-disk TsFile format unchanged and remain interoperable
with the Java and Python implementations.
## 2. Performance results
**Machine**: Apple M3, 8 cores, 16 GB, macOS 15.7.4. C++ built Release with
clang
17 `-O3 -march=native -flto`; Java on OpenJDK 17 (Corretto), `-Xmx6g`.
**Workload**: 5,000,000 rows, one device, N INT64 FIELD columns (N ∈ {4, 8, 16})
plus one STRING tag, xorshift-random values (the *same* sequence for all three
implementations), tablet / batch size 65,536. Throughput is in **million
rows/s**
(higher is better).
**Comparison style**: this is deliberately *not* a controlled API match — each
implementation uses its own fastest idiomatic path:
- **this branch** — bulk column write + `TsBlock` batch read + SIMD + thread
pool
- **develop** — bulk column write + `TsBlock` batch read, single-thread, no SIMD
- **Java** — `Tablet` batch write + `ResultSet` block-cursor read
(single-thread)
The **encoding + compression are pinned identical across all three**, reported
for two settings below. `java` and `develop` are single-thread baselines; each
`current_Nt` cell is this branch at N threads, annotated `(×java / ×develop)` —
its speedup over the two single-thread baselines.
### 2.1 TS_2DIFF + LZ4
**Write**
| Cols | java | develop | current_1t | current_2t | current_4t | current_8t |
|-----:|-----:|--------:|-----------:|-----------:|-----------:|-----------:|
| 4 | 1.7 | 17.7 | 44.8 (25.7/2.5) | 60.1 (34.5/3.4) | 73.7 (42.3/4.2) | 72.2
(41.4/4.1) |
| 8 | 2.3 | 9.8 | 25.9 (11.4/2.7) | 36.9 (16.2/3.8) | 49.5 (21.7/5.1) | 54.1
(23.7/5.5) |
| 16 | 1.3 | 5.4 | 14.2 (10.6/2.6) | 21.1 (15.8/3.9) | 30.9 (23.2/5.8) | 34.4
(25.8/6.4) |
**Read**
| Cols | java | develop | current_1t | current_2t | current_4t | current_8t |
|-----:|-----:|--------:|-----------:|-----------:|-----------:|-----------:|
| 4 | 7.5 | 9.3 | 67.4 (9.0/7.2) | 99.8 (13.4/10.7) | 127.4 (17.1/13.7) |
144.2 (19.3/15.5) |
| 8 | 4.7 | 4.8 | 35.8 (7.7/7.5) | 44.9 ( 9.6/ 9.4) | 78.9 (16.9/16.4) |
91.8 (19.7/19.1) |
| 16 | 2.1 | 2.4 | 19.6 (9.5/8.0) | 26.6 (13.0/10.9) | 41.0 (20.0/16.7) |
46.0 (22.5/18.8) |
### 2.2 PLAIN + uncompressed (codec-isolated cross-check)
**Write**
| Cols | java | develop | current_1t | current_2t | current_4t | current_8t |
|-----:|-----:|--------:|-----------:|-----------:|-----------:|-----------:|
| 4 | 3.6 | 18.2 | 43.9 (12.2/2.4) | 60.8 (16.9/3.3) | 72.4 (20.1/4.0) | 75.0
(20.8/4.1) |
| 8 | 2.4 | 10.1 | 26.1 (10.8/2.6) | 38.1 (15.8/3.8) | 50.4 (20.9/5.0) | 52.7
(21.9/5.2) |
| 16 | 1.3 | 4.9 | 14.6 (11.6/3.0) | 22.3 (17.8/4.6) | 31.5 (25.1/6.5) | 36.6
(29.2/7.5) |
**Read**
| Cols | java | develop | current_1t | current_2t | current_4t | current_8t |
|-----:|-----:|--------:|-----------:|-----------:|-----------:|-----------:|
| 4 | 8.6 | 9.2 | 69.6 (8.1/7.6) | 88.0 (10.2/ 9.6) | 142.1 (16.5/15.5) |
160.2 (18.6/17.4) |
| 8 | 4.8 | 4.1 | 37.7 (7.9/9.2) | 52.3 (11.0/12.8) | 80.7 (17.0/19.7) |
81.4 (17.1/19.9) |
| 16 | 2.8 | 2.4 | 19.9 (7.1/8.1) | 27.8 (10.0/11.4) | 43.5 (15.6/17.8) |
49.5 (17.7/20.2) |
### Takeaways
- **Single-thread** (current_1t vs develop): writes ≈ **2.4–3x**, reads ≈
**7–9x** faster. The read gap is the core of this work — batch decode + SIMD +
single-pass scatter into `TsBlock`.
- **Multi-thread**: at 8 threads on 8 cores the branch reaches roughly
**2.3–2.5x**
its own single-thread read on wide chunks (16 cols), ≈ **2.2x** on 8 cols;
writes scale ≈ **2–3x** (flattening past 4 threads — IO / serial page tail).
- **Encoding barely moves the C++ numbers** (PLAIN vs TS_2DIFF on random data,
which neither encodes nor compresses much): the bottleneck is decode +
scatter,
not the codec — which is exactly what the batch + SIMD path targets.
- Java write is slower largely because the public `Tablet.addValue` path builds
rows cell-by-cell (vs the C++ bulk column memcpy); that is Java's idiomatic
best with the v4 table API.
### PR
The pr is https://github.com/apache/tsfile/pull/823.
Feedback and review is very welcome, especially on:
- the batch encode/decode implementation (PLAIN / TS2DIFF / Gorilla) and the
batch null/filter handling,
- the parallelization implementation (single global worker pool, chunk-level
parallel decode, column-parallel encode),
- the SIMD / `-march=native` build switches and their fallbacks,
- the benchmark methodology.
Best regards,
Colin Li