Re: [I] [RFC] Add Random Access Write Support to IndexOutput [lucene]

via GitHub Wed, 12 Nov 2025 20:09:53 -0800


sam-herman commented on issue #15420:
URL: https://github.com/apache/lucene/issues/15420#issuecomment-3525220399

> Hi, I strongly disagree with this. We had random access in the eraly times
of Apache Lucene and luckily removed it in Lucene 4.0 for good reasons, most of
the details about why were already discussed here.

@uschindler in the use case 1 above:
[https://github.com/datastax/jvector/pull/542](https://github.com/datastax/jvector/pull/542),
shows a clear quantitative horizontal improvement in bandwidth for parallel
writes on NVMe and SSDs.
Can you share the reasoning why it was removed from earlier versions? Where
these benefits been considered back then? Would be great to have data driven
discussion around those aspects to weigh cons/pros outside of
idiomatic/syntactic preferences.

> The seconds use case is an anti-pattern in Lucene and breaks the whole
transactional behaviour, write once is a must and not uncommon.

I think we can still preserve the transactional contract, however the
underlying implementation will be different. The idea is to be able to make
changes in place within indices to reduce IO footprint.
In use case 2
(https://github.com/opensearch-project/opensearch-jvector/issues/169) for
example:

The performance boost from incremental insertion to a graph as opposed to
merges is substantial:

https://github.com/opensearch-project/opensearch-jvector/blob/main/merge_times_comparison.png?raw=true

The issue however, is that even with choosing a leading segment and
optimizing merges, there is still IO penalty for merge.
Imagine when you have a large segment of 1Billion vectors and you want to
add 1 more? Force merge in that scenario will require you to persist about 6 TB
of data back to disk.
Mutable indices can alleviate this problem by minimizing the IO signature
for only the minimal delta required to add to the graph.

> Lucene already has parallel construction: just index with multiple threads.

@rmuir see response above regarding the IO cost of merging large and small
segment.
Multiple index threads doesn't achieve the same outcome as it forces you to
create multiple segment, which means many small graphs. Mutable index allow you
to keep a single most compact and efficient graph at all time without the need
of merge which can become un-proportionally costly (see extreme example above
of 1 Billion vector dataset when additional one is added).

> I agree with Uwe here, I think these datastructures are just not very
efficient and not integrated well. With a quick search, you can find
alternatives for ANN that are more sympathetic to the hardware and use
sequential access.

Can you share concrete example for an alternative with quantitative data
behind it?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] [RFC] Add Random Access Write Support to IndexOutput [lucene]

Reply via email to