airborne12 opened a new pull request, #63651:
URL: https://github.com/apache/doris/pull/63651

   ### What problem does this PR solve?
   
   Issue Number: close #xxx
   
   Related PRs:
   - apache/doris-thirdparty#393 — clucene-side: expose
     `SDocumentsWriter` buffered postings via `getRAMUsed()`. Must merge
     before this PR (this PR bumps `contrib/clucene` to that commit).
   - apache/doris#63633 — downstream consumer (SPIMI V4 inverted index
     storage format). Sits on top of this PR.
   
   Problem Summary:
   
   Tighten BE memory tracking and capping for analyzed inverted index
   writers (fulltext + Chinese tokenization etc.). Today the BE's
   `IndexColumnWriter::size()` returns 0 for inverted index writers, so
   `segment_writer.cpp`'s segment memory estimate misses the CLucene
   buffered postings entirely — by far the largest source of resident
   bytes during a fulltext flush. Under tight memory pressure (cloud BE
   with `inverted_index_ram_dir_enable=false`), this causes the
   segment-builder to overshoot the memory budget and OOM the BE.
   
   Three commits, smallest-first:
   
   1. **`[improvement](be) Track inverted index writer memory estimate`** —
      Plumb a real `size()` reporter through the writer hierarchy. No
      behaviour change yet; just the scaffolding the next two commits
      plug values into.
   
   2. **`[improvement](be) Account fulltext inverted index writer memory`** —
      Implement `size()` for the fulltext path: report
      `_null_bitmap.getSizeInBytes()` + `index_writer->ramSizeInBytes()`
      (via the clucene `getRAMUsed()` exposed by
      doris-thirdparty#393) + `ram_directory_memory_size()`. The segment
      memory estimate now sees fulltext writer memory.
   
   3. **`[improvement](be) Cap fulltext index memory without ram dir`** —
      New config `inverted_index_ram_buffer_size_when_ram_dir_disabled`
      (default 64 MB). When the column has an analyzer AND
      `is_fs_directory(_dir)` (i.e. `inverted_index_ram_dir_enable=false`),
      clamp CLucene's `setRAMBufferSizeMB()` down from the global
      `inverted_index_ram_buffer_size = 512` MB to this new cap. Forces
      more frequent segment flushes in exchange for tighter per-column
      RAM peak — the trade-off cloud users want when RAM dir is
      disabled.
   
   #### Why a precursor PR
   
   These three are independently useful as memory-tracking +
   capping for the existing V1/V2/V3 CLucene writer path. They were
   originally bundled in apache/doris#63633 (the V4 SPIMI PR) as
   prerequisites; lifting them out makes the V4 PR smaller and lets
   the cloud teams pick up the memory tracking improvements without
   waiting for the larger SPIMI refactor to land.
   
   ### Release note
   
   Add inverted index writer memory tracking and a new cap config
   `inverted_index_ram_buffer_size_when_ram_dir_disabled` (default
   64 MB) that limits the CLucene buffered postings size for analyzed
   columns when RAM directory is disabled. Reduces BE OOM risk under
   fulltext-heavy concurrent loads in cloud mode.
   
   ### Check List (For Author)
   
   - Test:
       - [x] Unit Test
         
(`InvertedIndexWriterTest.FullTextStringMemoryEstimateIncludesBufferedPostings`
         + `*RamBufferCap*` tests)
       - [ ] Manual test
   - Behavior changed:
       - [ ] No.
       - [x] Yes — adds a new config that caps CLucene
         `setRAMBufferSizeMB` when RAM dir is disabled. Default 64 MB.
         Tunable at runtime (mutable bool).
   - Does this need documentation:
       - [x] Yes — config will appear in admin doc; doc PR will follow.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to