airborne12 opened a new pull request, #393:
URL: https://github.com/apache/doris-thirdparty/pull/393
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: apache/doris#63633 (SPIMI V4 inverted index storage format)
Problem Summary:
Four CLucene-side improvements that the Doris fulltext inverted index
writer depends on. Three reduce per-token Posting memory; one exposes
the RAM-used metric Doris uses for memory tracking.
#### Commits in this PR (chronological order)
1. **`[improvement](clucene) Report fulltext writer RAM usage`** —
Expose `SDocumentsWriter` buffered postings memory through
`getRAMUsed()` so Doris can account fulltext index writer memory
inside its segment memory estimate. Pure observability.
2. **`[improvement](clucene) Reduce fulltext writer posting buffer
memory`** — Tighten the `Posting` struct layout: drop unused
alignment padding and switch a small int field to a smaller width.
Cuts ~25 % off per-term posting memory on the standard fulltext
workload.
3. **`[improvement](clucene) Drop redundant Posting::textLen field`** —
`textLen` was a cached copy of the text vector's size that was
always reachable via `text.size()`. Removing the cached copy
shaves another 4 bytes per `Posting` with no algorithmic change.
4. **`[improvement](clucene) Allocate Posting position state only
when needed`** — Lazy-allocate the `Posting`'s position-tracking
members until the first position is recorded. For columns with
`support_phrase=false` or a single token per doc, this state is
never touched; previously it was always allocated.
#### Why these are stacked together
The four optimisations were developed against the same SDocumentWriter
critical path. Each one builds on the previous one's layout change —
splitting them across PRs would force re-conflict-resolution on the
same lines. Each commit compiles + passes the Doris BE unit tests on
its own (`InvertedIndexWriterTest.FullTextStringMemoryEstimate*`
covers the API surface touched).
#### Downstream Doris PR
This PR must merge **before** apache/doris#63633 can reference the
correct clucene submodule SHA. The Doris PR's submodule pointer will
be updated to the tip of `clucene` after this lands.
### Release note
None — internal CLucene memory optimisations, no behaviour change
on the Lucene 2.x on-disk format. Doris's memory tracking will
report tighter peaks on fulltext columns once the downstream PR
merges.
### Check List (For Author)
- Test:
- [x] Unit Test (covered by Doris BE UT
`InvertedIndexWriterTest.FullTextStringMemoryEstimateIncludesBufferedPostings`
and friends — exercised in the downstream Doris PR's CI)
- Behavior changed:
- [ ] No.
- [x] Yes — `IndexWriter::getRAMUsed()` now includes
`SDocumentsWriter` buffered postings memory (was 0 before).
- Does this need documentation:
- [x] No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]