Hi Dimitry, I worked initially from the papers cited in LUCENE-9004, which I think is also what Tomoko was doing. Later I did refer to nmslib too.
On Sat, Mar 27, 2021, 6:01 AM Dmitry Kan <dmitry.luc...@gmail.com> wrote: > Michael, > > I got some interest in this area and have been doing comparative study of > different KNN implementations and blogging about it. > > Did you use nmslib for HNSW implementation or something else? > > On Tue, 16 Mar 2021 at 22:47, Michael Sokolov <msoko...@gmail.com> wrote: > >> Yeah, HNSW is problematic in a few ways: (1) merging is costly due to >> the need to completely recreate the graph. (2) searching across a >> segmented index sacrifices much of the performance benefit of HNSW >> since the cost of searching HNSW graphs scales ~logarithmically with >> the size of the graph, so splitting into multiple graphs and then >> merge sorting results is pretty expensive. I guess the random access / >> scan forward dynamic is another problematic area. >> >> On Tue, Mar 16, 2021 at 1:28 PM Robert Muir <rcm...@gmail.com> wrote: >> > >> > Maybe that is so, but we should factor in everything: such as large >> scale indexing, not requiring whole data set to be in RAM, etc. Hey, it's >> Lucene! >> > >> > Because HNSW has dominated the nightly benchmarks, I have been digging >> through stacktraces and trying to figure out ways to make it work >> efficiently, and I'm not sure what to do. >> > Especially merge is painful: it seems to cause a storm of page >> faults/random accesses due to how it works, and I don't know yet how to >> make it better. >> > It seems to rebuild the entire graph, spraying random accesses across a >> "slow-wrapper" that binary searches each sub on every access. >> > I don't see any way to even amortize the pain with some kind of bulk >> merge trick. >> > >> > So if we find algorithms that scale better, I think we should lend a >> preference towards them. For example, algorithms that allow >> per-segment/sequential index and merge. >> > >> > On Tue, Mar 16, 2021 at 1:06 PM Michael Sokolov <msoko...@gmail.com> >> wrote: >> >> >> >> ann-benchmarks.com maintains open benchmarks of a bunch of ANN >> >> (approximate NN) algorithms. When we started this effort, HNSW was at >> >> the top of the heap in most of the benchmarks. >> >> >> >> On Tue, Mar 16, 2021 at 12:28 PM Robert Muir <rcm...@gmail.com> wrote: >> >> > >> >> > Where are the alternative algorithms that work on sequential >> iterators and don't need random access? >> >> > >> >> > Seems like these should be the ones we initially add to lucene, and >> HNSW should be put aside for now? (is it a toy, or can we do it without >> jazillions of random accesses?) >> >> > >> >> > On Tue, Mar 16, 2021 at 12:15 PM Michael Sokolov <msoko...@gmail.com> >> wrote: >> >> >> >> >> >> There's also some good discussion on >> >> >> https://issues.apache.org/jira/browse/LUCENE-9583 about random >> access >> >> >> vs iterator pattern that never got fully resolved. We said we would >> >> >> revisit after KNN (LUCENE-9004) landed, and now it has. The usage of >> >> >> random access is pretty well-established there, maybe we should >> >> >> abandon the iterator API since it is redundant (you can always >> iterate >> >> >> over a random access API if you know the size)? >> >> >> >> >> >> On Tue, Mar 16, 2021 at 12:10 PM Michael Sokolov < >> msoko...@gmail.com> wrote: >> >> >> > >> >> >> > Also, Tomoko re:LUCENE-9322, did it succeed? I guess we won't >> know for >> >> >> > sure unless someone revives >> >> >> > https://issues.apache.org/jira/browse/LUCENE-9136 or something >> like >> >> >> > that >> >> >> > >> >> >> > On Tue, Mar 16, 2021 at 12:04 PM Michael Sokolov < >> msoko...@gmail.com> wrote: >> >> >> > > >> >> >> > > Consistent plural naming makes sense to me. I think it ended up >> >> >> > > singular because I am biased to avoid plural names unless there >> is a >> >> >> > > useful distinction to be made. But consistency should trump my >> >> >> > > predilections. >> >> >> > > >> >> >> > > I think the reason we have search() on VectorValues is that we >> have >> >> >> > > LeafReader.getVectorValues() (by analogy to the DocValues >> iterators), >> >> >> > > but no way to access the VectorReader. Do you think we should >> also >> >> >> > > have LeafReader.getVectorReader()? Today it's only on >> CodecReader. >> >> >> > > >> >> >> > > Re: SearchStrategy.NONE; the idea is we support efficient >> access to >> >> >> > > floating point values. Using BinaryDocValues for this will >> always >> >> >> > > require an additional decoding step. I can see that the naming >> is >> >> >> > > confusing there. The intent is that you index the vector >> values, but >> >> >> > > no additional indexing data structure. Also: the reason HNSW is >> >> >> > > mentioned in these SearchStrategy enums is to make room for >> other >> >> >> > > vector indexing approaches, like LSH. There was a lot of >> discussion >> >> >> > > that we wanted an API that allowed for experimenting with other >> >> >> > > techniques for indexing and searching vector values. >> >> >> > > >> >> >> > > Adrien, you made an analogy to PerFieldPostingsFormat (and >> DocValues), >> >> >> > > but I think the situation is more akin to Points, where we have >> the >> >> >> > > options on IndexableField. The metadata we store there >> (dimension and >> >> >> > > score function) don't really result in different formats, ie >> code >> >> >> > > paths for indexing and storage; they are more like parameters >> to the >> >> >> > > format, in my mind. Perhaps the situation will look different >> when we >> >> >> > > get our second vector indexing strategy (like LSH). >> >> >> > > >> >> >> > > >> >> >> > > On Tue, Mar 16, 2021 at 10:19 AM Tomoko Uchida >> >> >> > > <tomoko.uchida.1...@gmail.com> wrote: >> >> >> > > > >> >> >> > > > > Should we rename VectorFormat to VectorsFormat? This would >> be more consistent with other file formats that use the plural, like >> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.? >> >> >> > > > >> >> >> > > > +1 for using plural form for consistency - if we reconsider >> the names, how about VectorValuesFormat so that it follows the naming >> convention for XXXValues? >> >> >> > > > >> >> >> > > > DocValuesFormat / DocValues >> >> >> > > > PointValuesFormat / PointValues >> >> >> > > > VectorValuesFormat / VectorValues (currently, VectorFormat / >> VectorValues) >> >> >> > > > >> >> >> > > > > Should SearchStrategy constants avoid explicit references >> to HNSW? >> >> >> > > > >> >> >> > > > Also +1 for decoupling HNSW specific implementations from >> general vectors, though I am not fully sure if we can strictly separate the >> similarity metrics and search algorithms for vectors. >> >> >> > > > LUCENE-9322 (unified vectors API) was resolved months ago, >> does it achieve its goal? I haven't followed the issue in months because of >> my laziness... >> >> >> > > > >> >> >> > > > Thanks, >> >> >> > > > Tomoko >> >> >> > > > >> >> >> > > > >> >> >> > > > 2021年3月16日(火) 19:32 Adrien Grand <jpou...@gmail.com>: >> >> >> > > >> >> >> >> > > >> Hello, >> >> >> > > >> >> >> >> > > >> I've tried to catch up on the vector API and I have the >> following questions. I've tried to read through discussions on JIRA first >> in case it had been covered, but it's possible I missed some relevant ones. >> >> >> > > >> >> >> >> > > >> Should VectorValues#search be on VectorReader instead? It >> felt a bit odd to me to have the search logic on the iterator. >> >> >> > > >> >> >> >> > > >> Do we need SearchStrategy.NONE? Documentation suggests that >> it allows storing vectors but that NN search won't be supported. This looks >> like a use-case for binary doc values to me? It also slightly caught me by >> surprise due to the inconsistency with IndexOptions.NONE, which means "do >> not index this field" (and likewise for DocValuesType.NONE), so I first >> assumed that SearchStrategy.NONE also meant "do not index this field as a >> vector". >> >> >> > > >> >> >> >> > > >> While postings and doc-value formats allow per-field >> configuration via PerFieldPostingsFormat/PerFieldDocValuesFormat, vectors >> use a different mechanism where VectorField#createHnswType sets attributes >> on the field type that the vectors writer then reads. Should we have a >> PerFieldVectorsFormat instead and configure these options via the vectors >> format? >> >> >> > > >> >> >> >> > > >> Should SearchStrategy constants avoid explicit references to >> HNSW? The rest of the API seems to try to be agnostic of the way that NN >> search is implemented. Could we make SearchStrategy only about the >> similarity metric that is used for vectors? This particular point seems >> discussed on LUCENE-9322 but I couldn't find the conclusion. >> >> >> > > >> >> >> >> > > >> Should we rename VectorFormat to VectorsFormat? This would >> be more consistent with other file formats that use the plural, like >> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.? >> >> >> > > >> >> >> >> > > >> -- >> >> >> > > >> Adrien >> >> >> >> >> >> >> --------------------------------------------------------------------- >> >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> >> >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> > > -- > -- > Dmitry Kan > Luke Toolbox: http://github.com/DmitryKey/luke > Blog: http://dmitrykan.blogspot.com > Twitter: http://twitter.com/dmitrykan > SemanticAnalyzer: www.semanticanalyzer.info >