iprithv commented on PR #15982: URL: https://github.com/apache/lucene/pull/15982#issuecomment-4496477009
@shubhamvishu ah okay, got it now. thanks for the clarification! yeah makes sense that flush behavior changes here, especially for byte vectors since they weren't really counted before. fixing this makes them visible to RAM accounting, so flushes happen at the right time instead of being delayed by doc count / time triggers. ran benchmarks at different scales: float32 vectors (1M cohere 1024d docs, hnsw, 8 threads) | branch | index time | final segments | |--------|------------|----------------| | main | 179.26 sec | 9 segments | | fix | 181.84 sec | 3 segments | ~1.4% difference, and fix ends up with fewer segments (3 vs 9) byte vectors | branch | 100k docs | 1M docs | |--------|-----------|---------| | main | 15.76 sec | 180.46 sec | | fix | 14.46 sec | 185.01 sec | at 100k, fix is ~8% faster (better flush timing). at 1M, fix is ~2.5% slower, which lines up with what you mentioned, more frequent flushes add some overhead at scale. overall: float case is roughly the same, byte case has a small overhead at larger scale but this is fixing correctness (they were not accounted at all before) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
