Re: Questions about the new vector API

Michael Sokolov Sun, 28 Mar 2021 03:51:48 -0700

Ugh sorry for misspelling your name, I blame the phone!

On Sun, Mar 28, 2021, 6:50 AM Michael Sokolov <msoko...@gmail.com> wrote:


> Hi Dimitry, I worked initially from the papers cited in LUCENE-9004, which
> I think is also what Tomoko was doing. Later I did refer to nmslib too.
>
> On Sat, Mar 27, 2021, 6:01 AM Dmitry Kan <dmitry.luc...@gmail.com> wrote:
>
>> Michael,
>>
>> I got some interest in this area and have been doing comparative study of
>> different KNN implementations and blogging about it.
>>
>> Did you use nmslib for HNSW implementation or something else?
>>
>> On Tue, 16 Mar 2021 at 22:47, Michael Sokolov <msoko...@gmail.com> wrote:
>>
>>> Yeah, HNSW is problematic in a few ways: (1) merging is costly due to
>>> the need to completely recreate the graph. (2) searching across a
>>> segmented index sacrifices much of the performance benefit of HNSW
>>> since the cost of searching HNSW graphs scales ~logarithmically with
>>> the size of the graph, so splitting into multiple graphs and then
>>> merge sorting results is pretty expensive. I guess the random access /
>>> scan forward dynamic is another problematic area.
>>>
>>> On Tue, Mar 16, 2021 at 1:28 PM Robert Muir <rcm...@gmail.com> wrote:
>>> >
>>> > Maybe that is so, but we should factor in everything: such as large
>>> scale indexing, not requiring whole data set to be in RAM, etc. Hey, it's
>>> Lucene!
>>> >
>>> > Because HNSW has dominated the nightly benchmarks, I have been digging
>>> through stacktraces and trying to figure out ways to make it work
>>> efficiently, and I'm not sure what to do.
>>> > Especially merge is painful: it seems to cause a storm of page
>>> faults/random accesses due to how it works, and I don't know yet how to
>>> make it better.
>>> > It seems to rebuild the entire graph, spraying random accesses across
>>> a "slow-wrapper" that binary searches each sub on every access.
>>> > I don't see any way to even amortize the pain with some kind of bulk
>>> merge trick.
>>> >
>>> > So if we find algorithms that scale better, I think we should lend a
>>> preference towards them. For example, algorithms that allow
>>> per-segment/sequential index and merge.
>>> >
>>> > On Tue, Mar 16, 2021 at 1:06 PM Michael Sokolov <msoko...@gmail.com>
>>> wrote:
>>> >>
>>> >> ann-benchmarks.com maintains open benchmarks of a bunch of ANN
>>> >> (approximate NN) algorithms. When we started this effort, HNSW was at
>>> >> the top of the heap in most of the benchmarks.
>>> >>
>>> >> On Tue, Mar 16, 2021 at 12:28 PM Robert Muir <rcm...@gmail.com>
>>> wrote:
>>> >> >
>>> >> > Where are the alternative algorithms that work on sequential
>>> iterators and don't need random access?
>>> >> >
>>> >> > Seems like these should be the ones we initially add to lucene, and
>>> HNSW should be put aside for now? (is it a toy, or can we do it without
>>> jazillions of random accesses?)
>>> >> >
>>> >> > On Tue, Mar 16, 2021 at 12:15 PM Michael Sokolov <
>>> msoko...@gmail.com> wrote:
>>> >> >>
>>> >> >> There's also some good discussion on
>>> >> >> https://issues.apache.org/jira/browse/LUCENE-9583 about random
>>> access
>>> >> >> vs iterator pattern that never got fully resolved. We said we would
>>> >> >> revisit after KNN (LUCENE-9004) landed, and now it has. The usage
>>> of
>>> >> >> random access is pretty well-established there, maybe we should
>>> >> >> abandon the iterator API since it is redundant (you can always
>>> iterate
>>> >> >> over a random access API if you know the size)?
>>> >> >>
>>> >> >> On Tue, Mar 16, 2021 at 12:10 PM Michael Sokolov <
>>> msoko...@gmail.com> wrote:
>>> >> >> >
>>> >> >> > Also, Tomoko re:LUCENE-9322, did it succeed? I guess we won't
>>> know for
>>> >> >> > sure unless someone revives
>>> >> >> > https://issues.apache.org/jira/browse/LUCENE-9136 or something
>>> like
>>> >> >> > that
>>> >> >> >
>>> >> >> > On Tue, Mar 16, 2021 at 12:04 PM Michael Sokolov <
>>> msoko...@gmail.com> wrote:
>>> >> >> > >
>>> >> >> > > Consistent plural naming makes sense to me. I think it ended up
>>> >> >> > > singular because I am biased to avoid plural names unless
>>> there is a
>>> >> >> > > useful distinction to be made. But consistency should trump my
>>> >> >> > > predilections.
>>> >> >> > >
>>> >> >> > > I think the reason we have search() on VectorValues is that we
>>> have
>>> >> >> > > LeafReader.getVectorValues() (by analogy to the DocValues
>>> iterators),
>>> >> >> > > but no way to access the VectorReader. Do you think we should
>>> also
>>> >> >> > > have LeafReader.getVectorReader()? Today it's only on
>>> CodecReader.
>>> >> >> > >
>>> >> >> > > Re: SearchStrategy.NONE; the idea is we support efficient
>>> access to
>>> >> >> > > floating point values. Using BinaryDocValues for this will
>>> always
>>> >> >> > > require an additional decoding step. I can see that the naming
>>> is
>>> >> >> > > confusing there. The intent is that you index the vector
>>> values, but
>>> >> >> > > no additional indexing data structure. Also: the reason HNSW is
>>> >> >> > > mentioned in these SearchStrategy enums is to make room for
>>> other
>>> >> >> > > vector indexing approaches, like LSH. There was a lot of
>>> discussion
>>> >> >> > > that we wanted an API that allowed for experimenting with other
>>> >> >> > > techniques for indexing and searching vector values.
>>> >> >> > >
>>> >> >> > > Adrien, you made an analogy to PerFieldPostingsFormat (and
>>> DocValues),
>>> >> >> > > but I think the situation is more akin to Points, where we
>>> have the
>>> >> >> > > options on IndexableField. The metadata we store there
>>> (dimension and
>>> >> >> > > score function) don't really result in different formats, ie
>>> code
>>> >> >> > > paths for indexing and storage; they are more like parameters
>>> to the
>>> >> >> > > format, in my mind. Perhaps the situation will look different
>>> when we
>>> >> >> > > get our second vector indexing strategy (like LSH).
>>> >> >> > >
>>> >> >> > >
>>> >> >> > > On Tue, Mar 16, 2021 at 10:19 AM Tomoko Uchida
>>> >> >> > > <tomoko.uchida.1...@gmail.com> wrote:
>>> >> >> > > >
>>> >> >> > > > > Should we rename VectorFormat to VectorsFormat? This would
>>> be more consistent with other file formats that use the plural, like
>>> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
>>> >> >> > > >
>>> >> >> > > > +1 for using plural form for consistency - if we reconsider
>>> the names, how about VectorValuesFormat so that it follows the naming
>>> convention for XXXValues?
>>> >> >> > > >
>>> >> >> > > > DocValuesFormat / DocValues
>>> >> >> > > > PointValuesFormat / PointValues
>>> >> >> > > > VectorValuesFormat / VectorValues (currently, VectorFormat /
>>> VectorValues)
>>> >> >> > > >
>>> >> >> > > > > Should SearchStrategy constants avoid explicit references
>>> to HNSW?
>>> >> >> > > >
>>> >> >> > > > Also +1 for decoupling HNSW specific implementations from
>>> general vectors, though I am not fully sure if we can strictly separate the
>>> similarity metrics and search algorithms for vectors.
>>> >> >> > > > LUCENE-9322 (unified vectors API) was resolved months ago,
>>> does it achieve its goal? I haven't followed the issue in months because of
>>> my laziness...
>>> >> >> > > >
>>> >> >> > > > Thanks,
>>> >> >> > > > Tomoko
>>> >> >> > > >
>>> >> >> > > >
>>> >> >> > > > 2021年3月16日(火) 19:32 Adrien Grand <jpou...@gmail.com>:
>>> >> >> > > >>
>>> >> >> > > >> Hello,
>>> >> >> > > >>
>>> >> >> > > >> I've tried to catch up on the vector API and I have the
>>> following questions. I've tried to read through discussions on JIRA first
>>> in case it had been covered, but it's possible I missed some relevant ones.
>>> >> >> > > >>
>>> >> >> > > >> Should VectorValues#search be on VectorReader instead? It
>>> felt a bit odd to me to have the search logic on the iterator.
>>> >> >> > > >>
>>> >> >> > > >> Do we need SearchStrategy.NONE? Documentation suggests that
>>> it allows storing vectors but that NN search won't be supported. This looks
>>> like a use-case for binary doc values to me? It also slightly caught me by
>>> surprise due to the inconsistency with IndexOptions.NONE, which means "do
>>> not index this field" (and likewise for DocValuesType.NONE), so I first
>>> assumed that SearchStrategy.NONE also meant "do not index this field as a
>>> vector".
>>> >> >> > > >>
>>> >> >> > > >> While postings and doc-value formats allow per-field
>>> configuration via PerFieldPostingsFormat/PerFieldDocValuesFormat, vectors
>>> use a different mechanism where VectorField#createHnswType sets attributes
>>> on the field type that the vectors writer then reads. Should we have a
>>> PerFieldVectorsFormat instead and configure these options via the vectors
>>> format?
>>> >> >> > > >>
>>> >> >> > > >> Should SearchStrategy constants avoid explicit references
>>> to HNSW? The rest of the API seems to try to be agnostic of the way that NN
>>> search is implemented. Could we make SearchStrategy only about the
>>> similarity metric that is used for vectors? This particular point seems
>>> discussed on LUCENE-9322 but I couldn't find the conclusion.
>>> >> >> > > >>
>>> >> >> > > >> Should we rename VectorFormat to VectorsFormat? This would
>>> be more consistent with other file formats that use the plural, like
>>> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
>>> >> >> > > >>
>>> >> >> > > >> --
>>> >> >> > > >> Adrien
>>> >> >>
>>> >> >>
>>> ---------------------------------------------------------------------
>>> >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> >> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>>> >> >>
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>>> >>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>>
>>
>> --
>> --
>> Dmitry Kan
>> Luke Toolbox: http://github.com/DmitryKey/luke
>> Blog: http://dmitrykan.blogspot.com
>> Twitter: http://twitter.com/dmitrykan
>> SemanticAnalyzer: www.semanticanalyzer.info
>>
>

Re: Questions about the new vector API

Reply via email to