airborne12 opened a new pull request, #393:
URL: https://github.com/apache/doris-thirdparty/pull/393

   ### What problem does this PR solve?
   
   Issue Number: close #xxx
   
   Related PR: apache/doris#63633 (SPIMI V4 inverted index storage format)
   
   Problem Summary:
   
   Four CLucene-side improvements that the Doris fulltext inverted index
   writer depends on. Three reduce per-token Posting memory; one exposes
   the RAM-used metric Doris uses for memory tracking.
   
   #### Commits in this PR (chronological order)
   
   1. **`[improvement](clucene) Report fulltext writer RAM usage`** —
      Expose `SDocumentsWriter` buffered postings memory through
      `getRAMUsed()` so Doris can account fulltext index writer memory
      inside its segment memory estimate. Pure observability.
   
   2. **`[improvement](clucene) Reduce fulltext writer posting buffer
      memory`** — Tighten the `Posting` struct layout: drop unused
      alignment padding and switch a small int field to a smaller width.
      Cuts ~25 % off per-term posting memory on the standard fulltext
      workload.
   
   3. **`[improvement](clucene) Drop redundant Posting::textLen field`** —
      `textLen` was a cached copy of the text vector's size that was
      always reachable via `text.size()`. Removing the cached copy
      shaves another 4 bytes per `Posting` with no algorithmic change.
   
   4. **`[improvement](clucene) Allocate Posting position state only
      when needed`** — Lazy-allocate the `Posting`'s position-tracking
      members until the first position is recorded. For columns with
      `support_phrase=false` or a single token per doc, this state is
      never touched; previously it was always allocated.
   
   #### Why these are stacked together
   
   The four optimisations were developed against the same SDocumentWriter
   critical path. Each one builds on the previous one's layout change —
   splitting them across PRs would force re-conflict-resolution on the
   same lines. Each commit compiles + passes the Doris BE unit tests on
   its own (`InvertedIndexWriterTest.FullTextStringMemoryEstimate*`
   covers the API surface touched).
   
   #### Downstream Doris PR
   
   This PR must merge **before** apache/doris#63633 can reference the
   correct clucene submodule SHA. The Doris PR's submodule pointer will
   be updated to the tip of `clucene` after this lands.
   
   ### Release note
   
   None — internal CLucene memory optimisations, no behaviour change
   on the Lucene 2.x on-disk format. Doris's memory tracking will
   report tighter peaks on fulltext columns once the downstream PR
   merges.
   
   ### Check List (For Author)
   
   - Test:
       - [x] Unit Test (covered by Doris BE UT
         
`InvertedIndexWriterTest.FullTextStringMemoryEstimateIncludesBufferedPostings`
         and friends — exercised in the downstream Doris PR's CI)
   - Behavior changed:
       - [ ] No.
       - [x] Yes — `IndexWriter::getRAMUsed()` now includes
         `SDocumentsWriter` buffered postings memory (was 0 before).
   - Does this need documentation:
       - [x] No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to