hanahmily opened a new pull request, #1182:
URL: https://github.com/apache/skywalking-banyandb/pull/1182
## Summary
Performance work on the trace/stream query read path, the distributed
benchmark used to
measure it, and a fix for the replicated-schema integration setup.
The headline change makes **point-lookup queries decode only the block
metadata they need**.
Previously a query decoded *every* `blockMetadata` entry of a ~128KB
primary-block granule
(building each entry's tag/column map) just to use the few entries whose key
was requested.
Both the trace iterator (keyed by `traceID`) and the stream iterator (keyed
by `seriesID`)
already locate the wanted entries via binary search, but `findBlock` only
ever returns entries
whose key is in the queried set — so decoding the rest is wasted work. We
now merge-walk the
sorted granule against the sorted query keys and fully decode only the
matched entries, cheaply
skipping the others without allocating their maps. The result is
byte-identical to the full
decode filtered to the queried keys (locked by equivalence unit tests);
migration paths keep
the full decode.
## Changes
- **`perf(trace)`** — speed up the `trace_by_id` path: skip the wildcard
series-index resolution
for trace-id queries (its only consumer early-returns when `TraceIDs` is
set); decode the
on-disk columnar block directly into per-trace results in the vectorized
path (bypassing the
`blockCursor` middle format and the now-removed Phase-2 operator
pipeline); and lazily decode
primary-block metadata (`unmarshalBlockMetadataFiltered`).
- **`perf(stream)`** — apply the same lazy primary-block-metadata decode to
the stream query path
(keyed by `seriesID`). *Not* applied to measure, whose `readPrimaryBlock`
caches the full
decoded granule across queries (a filtered decode would corrupt the shared
cache, and the cache
already amortizes the cost).
- **`test(querybench)`** — distributed trace query benchmark harness
(`trace_by_id`,
`trace_tag_filter`) with a real-world cardinality matrix, Docker resource
limits, CPU/heap
profiling, and merged JSON/Markdown reports.
- **`test(replication)`** — add the missing `sw_cross_segment*` groups to
the replicated
measure/stream/trace testdata. The replicated loaders create a curated
group set then load the
standard resources (which include cross-segment fixtures); those groups
were never mirrored, so
the property-schema preload failed with `NotFound` in `BeforeSuite`.
## Benchmark (same-session A/B, Docker 4cpu/8g, 300 iters/30 warmup)
`trace_by_id`, baseline → lazy-decode:
| metric | 100k row | 100k vec | 1M row | 1M vec |
| --- | --- | --- | --- | --- |
| p50 ms | 3.05 → 1.86 | 2.96 → 2.19 | 3.18 → 2.06 | 3.65 → 2.42 |
| mallocs/query | 16944 → 6278 | 23935 → 5913 | 16761 → 5853 | 16374 → 6088 |
| QPS | 1034 → 1632 | 1095 → 1454 | 965 → 1523 | 877 → 1278 |
CPU profile: the `unmarshalBlockMetadata` frame drops from ~18–29% cum to
~0; `readPrimaryBlock`
is then just zstd-decompress + disk read. `trace_tag_filter` (broad scan)
shows no regression.
The win scales with selectivity (biggest for few-key lookups, a wash for
full scans) and is shared
by row and vec, since the dominant point-lookup cost is the engine-agnostic
metadata decode.
## Testing
- `make check` + `make lint` (full `.golangci.yml`) — clean.
- Unit: `banyand/trace`, `banyand/stream`, `pkg/query/vectorized/trace` —
pass. New equivalence
tests `Test_unmarshalBlockMetadataFiltered` (trace + stream) lock the
filtered-vs-full decode.
- Integration: `distributed/query` and `standalone/query` (full suites)
pass; `replication` passes
in isolation (the `BeforeSuite` preload now succeeds).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]