airborne12 opened a new pull request, #63651:
URL: https://github.com/apache/doris/pull/63651
### What problem does this PR solve?
Issue Number: close #xxx
Related PRs:
- apache/doris-thirdparty#393 — clucene-side: expose
`SDocumentsWriter` buffered postings via `getRAMUsed()`. Must merge
before this PR (this PR bumps `contrib/clucene` to that commit).
- apache/doris#63633 — downstream consumer (SPIMI V4 inverted index
storage format). Sits on top of this PR.
Problem Summary:
Tighten BE memory tracking and capping for analyzed inverted index
writers (fulltext + Chinese tokenization etc.). Today the BE's
`IndexColumnWriter::size()` returns 0 for inverted index writers, so
`segment_writer.cpp`'s segment memory estimate misses the CLucene
buffered postings entirely — by far the largest source of resident
bytes during a fulltext flush. Under tight memory pressure (cloud BE
with `inverted_index_ram_dir_enable=false`), this causes the
segment-builder to overshoot the memory budget and OOM the BE.
Three commits, smallest-first:
1. **`[improvement](be) Track inverted index writer memory estimate`** —
Plumb a real `size()` reporter through the writer hierarchy. No
behaviour change yet; just the scaffolding the next two commits
plug values into.
2. **`[improvement](be) Account fulltext inverted index writer memory`** —
Implement `size()` for the fulltext path: report
`_null_bitmap.getSizeInBytes()` + `index_writer->ramSizeInBytes()`
(via the clucene `getRAMUsed()` exposed by
doris-thirdparty#393) + `ram_directory_memory_size()`. The segment
memory estimate now sees fulltext writer memory.
3. **`[improvement](be) Cap fulltext index memory without ram dir`** —
New config `inverted_index_ram_buffer_size_when_ram_dir_disabled`
(default 64 MB). When the column has an analyzer AND
`is_fs_directory(_dir)` (i.e. `inverted_index_ram_dir_enable=false`),
clamp CLucene's `setRAMBufferSizeMB()` down from the global
`inverted_index_ram_buffer_size = 512` MB to this new cap. Forces
more frequent segment flushes in exchange for tighter per-column
RAM peak — the trade-off cloud users want when RAM dir is
disabled.
#### Why a precursor PR
These three are independently useful as memory-tracking +
capping for the existing V1/V2/V3 CLucene writer path. They were
originally bundled in apache/doris#63633 (the V4 SPIMI PR) as
prerequisites; lifting them out makes the V4 PR smaller and lets
the cloud teams pick up the memory tracking improvements without
waiting for the larger SPIMI refactor to land.
### Release note
Add inverted index writer memory tracking and a new cap config
`inverted_index_ram_buffer_size_when_ram_dir_disabled` (default
64 MB) that limits the CLucene buffered postings size for analyzed
columns when RAM directory is disabled. Reduces BE OOM risk under
fulltext-heavy concurrent loads in cloud mode.
### Check List (For Author)
- Test:
- [x] Unit Test
(`InvertedIndexWriterTest.FullTextStringMemoryEstimateIncludesBufferedPostings`
+ `*RamBufferCap*` tests)
- [ ] Manual test
- Behavior changed:
- [ ] No.
- [x] Yes — adds a new config that caps CLucene
`setRAMBufferSizeMB` when RAM dir is disabled. Default 64 MB.
Tunable at runtime (mutable bool).
- Does this need documentation:
- [x] Yes — config will appear in admin doc; doc PR will follow.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]