sam-herman commented on issue #15420: URL: https://github.com/apache/lucene/issues/15420#issuecomment-3525220399
> Hi, I strongly disagree with this. We had random access in the eraly times of Apache Lucene and luckily removed it in Lucene 4.0 for good reasons, most of the details about why were already discussed here. @uschindler in the use case 1 above: [https://github.com/datastax/jvector/pull/542](https://github.com/datastax/jvector/pull/542), shows a clear quantitative horizontal improvement in bandwidth for parallel writes on NVMe and SSDs. Can you share the reasoning why it was removed from earlier versions? Where these benefits been considered back then? Would be great to have data driven discussion around those aspects to weigh cons/pros outside of idiomatic/syntactic preferences. > The seconds use case is an anti-pattern in Lucene and breaks the whole transactional behaviour, write once is a must and not uncommon. I think we can still preserve the transactional contract, however the underlying implementation will be different. The idea is to be able to make changes in place within indices to reduce IO footprint. In use case 2 (https://github.com/opensearch-project/opensearch-jvector/issues/169) for example: The performance boost from incremental insertion to a graph as opposed to merges is substantial: https://github.com/opensearch-project/opensearch-jvector/blob/main/merge_times_comparison.png?raw=true The issue however, is that even with choosing a leading segment and optimizing merges, there is still IO penalty for merge. Imagine when you have a large segment of 1Billion vectors and you want to add 1 more? Force merge in that scenario will require you to persist about 6 TB of data back to disk. Mutable indices can alleviate this problem by minimizing the IO signature for only the minimal delta required to add to the graph. > Lucene already has parallel construction: just index with multiple threads. @rmuir see response above regarding the IO cost of merging large and small segment. Multiple index threads doesn't achieve the same outcome as it forces you to create multiple segment, which means many small graphs. Mutable index allow you to keep a single most compact and efficient graph at all time without the need of merge which can become un-proportionally costly (see extreme example above of 1 Billion vector dataset when additional one is added). > I agree with Uwe here, I think these datastructures are just not very efficient and not integrated well. With a quick search, you can find alternatives for ANN that are more sympathetic to the hardware and use sequential access. Can you share concrete example for an alternative with quantitative data behind it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
