Ugh sorry for misspelling your name, I blame the phone! On Sun, Mar 28, 2021, 6:50 AM Michael Sokolov <msoko...@gmail.com> wrote:
> Hi Dimitry, I worked initially from the papers cited in LUCENE-9004, which > I think is also what Tomoko was doing. Later I did refer to nmslib too. > > On Sat, Mar 27, 2021, 6:01 AM Dmitry Kan <dmitry.luc...@gmail.com> wrote: > >> Michael, >> >> I got some interest in this area and have been doing comparative study of >> different KNN implementations and blogging about it. >> >> Did you use nmslib for HNSW implementation or something else? >> >> On Tue, 16 Mar 2021 at 22:47, Michael Sokolov <msoko...@gmail.com> wrote: >> >>> Yeah, HNSW is problematic in a few ways: (1) merging is costly due to >>> the need to completely recreate the graph. (2) searching across a >>> segmented index sacrifices much of the performance benefit of HNSW >>> since the cost of searching HNSW graphs scales ~logarithmically with >>> the size of the graph, so splitting into multiple graphs and then >>> merge sorting results is pretty expensive. I guess the random access / >>> scan forward dynamic is another problematic area. >>> >>> On Tue, Mar 16, 2021 at 1:28 PM Robert Muir <rcm...@gmail.com> wrote: >>> > >>> > Maybe that is so, but we should factor in everything: such as large >>> scale indexing, not requiring whole data set to be in RAM, etc. Hey, it's >>> Lucene! >>> > >>> > Because HNSW has dominated the nightly benchmarks, I have been digging >>> through stacktraces and trying to figure out ways to make it work >>> efficiently, and I'm not sure what to do. >>> > Especially merge is painful: it seems to cause a storm of page >>> faults/random accesses due to how it works, and I don't know yet how to >>> make it better. >>> > It seems to rebuild the entire graph, spraying random accesses across >>> a "slow-wrapper" that binary searches each sub on every access. >>> > I don't see any way to even amortize the pain with some kind of bulk >>> merge trick. >>> > >>> > So if we find algorithms that scale better, I think we should lend a >>> preference towards them. For example, algorithms that allow >>> per-segment/sequential index and merge. >>> > >>> > On Tue, Mar 16, 2021 at 1:06 PM Michael Sokolov <msoko...@gmail.com> >>> wrote: >>> >> >>> >> ann-benchmarks.com maintains open benchmarks of a bunch of ANN >>> >> (approximate NN) algorithms. When we started this effort, HNSW was at >>> >> the top of the heap in most of the benchmarks. >>> >> >>> >> On Tue, Mar 16, 2021 at 12:28 PM Robert Muir <rcm...@gmail.com> >>> wrote: >>> >> > >>> >> > Where are the alternative algorithms that work on sequential >>> iterators and don't need random access? >>> >> > >>> >> > Seems like these should be the ones we initially add to lucene, and >>> HNSW should be put aside for now? (is it a toy, or can we do it without >>> jazillions of random accesses?) >>> >> > >>> >> > On Tue, Mar 16, 2021 at 12:15 PM Michael Sokolov < >>> msoko...@gmail.com> wrote: >>> >> >> >>> >> >> There's also some good discussion on >>> >> >> https://issues.apache.org/jira/browse/LUCENE-9583 about random >>> access >>> >> >> vs iterator pattern that never got fully resolved. We said we would >>> >> >> revisit after KNN (LUCENE-9004) landed, and now it has. The usage >>> of >>> >> >> random access is pretty well-established there, maybe we should >>> >> >> abandon the iterator API since it is redundant (you can always >>> iterate >>> >> >> over a random access API if you know the size)? >>> >> >> >>> >> >> On Tue, Mar 16, 2021 at 12:10 PM Michael Sokolov < >>> msoko...@gmail.com> wrote: >>> >> >> > >>> >> >> > Also, Tomoko re:LUCENE-9322, did it succeed? I guess we won't >>> know for >>> >> >> > sure unless someone revives >>> >> >> > https://issues.apache.org/jira/browse/LUCENE-9136 or something >>> like >>> >> >> > that >>> >> >> > >>> >> >> > On Tue, Mar 16, 2021 at 12:04 PM Michael Sokolov < >>> msoko...@gmail.com> wrote: >>> >> >> > > >>> >> >> > > Consistent plural naming makes sense to me. I think it ended up >>> >> >> > > singular because I am biased to avoid plural names unless >>> there is a >>> >> >> > > useful distinction to be made. But consistency should trump my >>> >> >> > > predilections. >>> >> >> > > >>> >> >> > > I think the reason we have search() on VectorValues is that we >>> have >>> >> >> > > LeafReader.getVectorValues() (by analogy to the DocValues >>> iterators), >>> >> >> > > but no way to access the VectorReader. Do you think we should >>> also >>> >> >> > > have LeafReader.getVectorReader()? Today it's only on >>> CodecReader. >>> >> >> > > >>> >> >> > > Re: SearchStrategy.NONE; the idea is we support efficient >>> access to >>> >> >> > > floating point values. Using BinaryDocValues for this will >>> always >>> >> >> > > require an additional decoding step. I can see that the naming >>> is >>> >> >> > > confusing there. The intent is that you index the vector >>> values, but >>> >> >> > > no additional indexing data structure. Also: the reason HNSW is >>> >> >> > > mentioned in these SearchStrategy enums is to make room for >>> other >>> >> >> > > vector indexing approaches, like LSH. There was a lot of >>> discussion >>> >> >> > > that we wanted an API that allowed for experimenting with other >>> >> >> > > techniques for indexing and searching vector values. >>> >> >> > > >>> >> >> > > Adrien, you made an analogy to PerFieldPostingsFormat (and >>> DocValues), >>> >> >> > > but I think the situation is more akin to Points, where we >>> have the >>> >> >> > > options on IndexableField. The metadata we store there >>> (dimension and >>> >> >> > > score function) don't really result in different formats, ie >>> code >>> >> >> > > paths for indexing and storage; they are more like parameters >>> to the >>> >> >> > > format, in my mind. Perhaps the situation will look different >>> when we >>> >> >> > > get our second vector indexing strategy (like LSH). >>> >> >> > > >>> >> >> > > >>> >> >> > > On Tue, Mar 16, 2021 at 10:19 AM Tomoko Uchida >>> >> >> > > <tomoko.uchida.1...@gmail.com> wrote: >>> >> >> > > > >>> >> >> > > > > Should we rename VectorFormat to VectorsFormat? This would >>> be more consistent with other file formats that use the plural, like >>> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.? >>> >> >> > > > >>> >> >> > > > +1 for using plural form for consistency - if we reconsider >>> the names, how about VectorValuesFormat so that it follows the naming >>> convention for XXXValues? >>> >> >> > > > >>> >> >> > > > DocValuesFormat / DocValues >>> >> >> > > > PointValuesFormat / PointValues >>> >> >> > > > VectorValuesFormat / VectorValues (currently, VectorFormat / >>> VectorValues) >>> >> >> > > > >>> >> >> > > > > Should SearchStrategy constants avoid explicit references >>> to HNSW? >>> >> >> > > > >>> >> >> > > > Also +1 for decoupling HNSW specific implementations from >>> general vectors, though I am not fully sure if we can strictly separate the >>> similarity metrics and search algorithms for vectors. >>> >> >> > > > LUCENE-9322 (unified vectors API) was resolved months ago, >>> does it achieve its goal? I haven't followed the issue in months because of >>> my laziness... >>> >> >> > > > >>> >> >> > > > Thanks, >>> >> >> > > > Tomoko >>> >> >> > > > >>> >> >> > > > >>> >> >> > > > 2021年3月16日(火) 19:32 Adrien Grand <jpou...@gmail.com>: >>> >> >> > > >> >>> >> >> > > >> Hello, >>> >> >> > > >> >>> >> >> > > >> I've tried to catch up on the vector API and I have the >>> following questions. I've tried to read through discussions on JIRA first >>> in case it had been covered, but it's possible I missed some relevant ones. >>> >> >> > > >> >>> >> >> > > >> Should VectorValues#search be on VectorReader instead? It >>> felt a bit odd to me to have the search logic on the iterator. >>> >> >> > > >> >>> >> >> > > >> Do we need SearchStrategy.NONE? Documentation suggests that >>> it allows storing vectors but that NN search won't be supported. This looks >>> like a use-case for binary doc values to me? It also slightly caught me by >>> surprise due to the inconsistency with IndexOptions.NONE, which means "do >>> not index this field" (and likewise for DocValuesType.NONE), so I first >>> assumed that SearchStrategy.NONE also meant "do not index this field as a >>> vector". >>> >> >> > > >> >>> >> >> > > >> While postings and doc-value formats allow per-field >>> configuration via PerFieldPostingsFormat/PerFieldDocValuesFormat, vectors >>> use a different mechanism where VectorField#createHnswType sets attributes >>> on the field type that the vectors writer then reads. Should we have a >>> PerFieldVectorsFormat instead and configure these options via the vectors >>> format? >>> >> >> > > >> >>> >> >> > > >> Should SearchStrategy constants avoid explicit references >>> to HNSW? The rest of the API seems to try to be agnostic of the way that NN >>> search is implemented. Could we make SearchStrategy only about the >>> similarity metric that is used for vectors? This particular point seems >>> discussed on LUCENE-9322 but I couldn't find the conclusion. >>> >> >> > > >> >>> >> >> > > >> Should we rename VectorFormat to VectorsFormat? This would >>> be more consistent with other file formats that use the plural, like >>> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.? >>> >> >> > > >> >>> >> >> > > >> -- >>> >> >> > > >> Adrien >>> >> >> >>> >> >> >>> --------------------------------------------------------------------- >>> >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> >> >> For additional commands, e-mail: dev-h...@lucene.apache.org >>> >> >> >>> >> >>> >> --------------------------------------------------------------------- >>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> >> For additional commands, e-mail: dev-h...@lucene.apache.org >>> >> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: dev-h...@lucene.apache.org >>> >>> >> >> -- >> -- >> Dmitry Kan >> Luke Toolbox: http://github.com/DmitryKey/luke >> Blog: http://dmitrykan.blogspot.com >> Twitter: http://twitter.com/dmitrykan >> SemanticAnalyzer: www.semanticanalyzer.info >> >