sam-herman commented on issue #15420:
URL: https://github.com/apache/lucene/issues/15420#issuecomment-3525220399

   > Hi, I strongly disagree with this. We had random access in the eraly times 
of Apache Lucene and luckily removed it in Lucene 4.0 for good reasons, most of 
the details about why were already discussed here.
   
   @uschindler in the use case 1 above: 
[https://github.com/datastax/jvector/pull/542](https://github.com/datastax/jvector/pull/542),
 shows a clear quantitative horizontal improvement in bandwidth for parallel 
writes on NVMe and SSDs.
   Can you share the reasoning why it was removed from earlier versions? Where 
these benefits been considered back then? Would be great to have data driven 
discussion around those aspects to weigh cons/pros outside of 
idiomatic/syntactic preferences.
   
   
   > The seconds use case is an anti-pattern in Lucene and breaks the whole 
transactional behaviour, write once is a must and not uncommon.
   
    I think we can still preserve the transactional contract, however the 
underlying implementation will be different. The idea is to be able to make 
changes in place within indices to reduce IO footprint.
   In use case 2 
(https://github.com/opensearch-project/opensearch-jvector/issues/169) for 
example:
   
   The performance boost from incremental insertion to a graph as opposed to 
merges is substantial:
   
https://github.com/opensearch-project/opensearch-jvector/blob/main/merge_times_comparison.png?raw=true
   
   The issue however, is that even with choosing a leading segment and 
optimizing merges, there is still IO penalty for merge.
   Imagine when you have a large segment of 1Billion vectors and you want to 
add 1 more? Force merge in that scenario will require you to persist about 6 TB 
of data back to disk.
   Mutable indices can alleviate this problem by minimizing the IO signature 
for only the minimal delta required to add to the graph.
   
   
   > Lucene already has parallel construction: just index with multiple threads.
   
   @rmuir see response above regarding the IO cost of merging large and small 
segment.
   Multiple index threads doesn't achieve the same outcome as it forces you to 
create multiple segment, which means many small graphs. Mutable index allow you 
to keep a single most compact and efficient graph at all time without the need 
of merge which can become un-proportionally costly (see extreme example above 
of 1 Billion vector dataset when additional one is added).
   
   > I agree with Uwe here, I think these datastructures are just not very 
efficient and not integrated well. With a quick search, you can find 
alternatives for ANN that are more sympathetic to the hardware and use 
sequential access.
   
   Can you share concrete example for an alternative with quantitative data 
behind it?
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to