Re: [PR] Fix undercounting of RAM used by vectors buffered in in-memory segments [lucene]

via GitHub Wed, 20 May 2026 02:00:42 -0700


iprithv commented on PR #15982:
URL: https://github.com/apache/lucene/pull/15982#issuecomment-4496477009


   @shubhamvishu ah okay, got it now. thanks for the clarification!
   
   yeah makes sense that flush behavior changes here, especially for byte 
vectors since they weren't really counted before. fixing this makes them 
visible to RAM accounting, so flushes happen at the right time instead of being 
delayed by doc count / time triggers.
   
   ran benchmarks at different scales:
   
   float32 vectors (1M cohere 1024d docs, hnsw, 8 threads)
   
   | branch | index time | final segments |
   |--------|------------|----------------|
   | main   | 179.26 sec | 9 segments     |
   | fix    | 181.84 sec | 3 segments     |
   
   ~1.4% difference, and fix ends up with fewer segments (3 vs 9)
   
   
   byte vectors
   
   | branch | 100k docs | 1M docs |
   |--------|-----------|---------|
   | main   | 15.76 sec | 180.46 sec |
   | fix    | 14.46 sec | 185.01 sec |
   
   at 100k, fix is ~8% faster (better flush timing). at 1M, fix is ~2.5% 
slower, which lines up with what you mentioned, more frequent flushes add some 
overhead at scale.
   
   overall: float case is roughly the same, byte case has a small overhead at 
larger scale but this is fixing correctness (they were not accounted at all 
before)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fix undercounting of RAM used by vectors buffered in in-memory segments [lucene]

Reply via email to