sam-herman commented on issue #15420: URL: https://github.com/apache/lucene/issues/15420#issuecomment-3530905602
> Striking the right balance between write amplification and merging costs is key. While I think the typical tier-merging system might not work best for graph structures, it doesn't mean we should abandon it and attempt to merge in teeny segments into a single giant structure. Finding an appropriate balance is what you need. Thanks for the feedback — and to clarify, the goal of this RFC is not to encourage more frequent merges or to replace Lucene’s tiered merge policy. The motivation behind exposing a `RandomAccessWriter`-like primitive is to enable a path toward optional in-place updatable graph segments, which would significantly improve the ability to handle streaming inserts and deletes for vector/graph fields. This is motivated by real-world workloads similar to what we outline in the OpenSearch JVector proposal (https://github.com/opensearch-project/opensearch-jvector/issues/169). What we’re proposing is essentially a staged, opt-in evolution of the codec and segment APIs — not a change to Lucene’s core merge strategy: **Phase 1** — Leading-segment incremental merging The initial step only enables the largest graph segment to incrementally append new nodes, reducing the cost of constantly rebuilding full graph structures on every merge. This doesn’t require changing merge frequency; it simply avoids unnecessary rewrite work in graph-heavy fields. **Phase 2** — Per-field segment policies Similar to today’s per-field formats, this phase would allow a field (like a graph/vector field) to opt into a different segment-lifespan policy without affecting the rest of the index. This isolates the approach and keeps Lucene’s tiered merging intact for all other structures. **Phase 3** — Optional in-place updatable graph segments This is the long-term goal: providing an interface that allows graph-based fields to apply small deltas in-place and persist them without requiring a full segment rewrite. Our own on-disk representations already support this pattern efficiently, and exposing these lower-level primitives in Lucene would allow experimentation — both within Lucene and by external codecs — without committing Lucene to a particular implementation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
