Re: Questions about the new vector API

Michael Sokolov Sun, 28 Mar 2021 03:50:54 -0700

Hi Dimitry, I worked initially from the papers cited in LUCENE-9004, which
I think is also what Tomoko was doing. Later I did refer to nmslib too.


On Sat, Mar 27, 2021, 6:01 AM Dmitry Kan <dmitry.luc...@gmail.com> wrote:

> Michael,
>
> I got some interest in this area and have been doing comparative study of
> different KNN implementations and blogging about it.
>
> Did you use nmslib for HNSW implementation or something else?
>
> On Tue, 16 Mar 2021 at 22:47, Michael Sokolov <msoko...@gmail.com> wrote:
>
>> Yeah, HNSW is problematic in a few ways: (1) merging is costly due to
>> the need to completely recreate the graph. (2) searching across a
>> segmented index sacrifices much of the performance benefit of HNSW
>> since the cost of searching HNSW graphs scales ~logarithmically with
>> the size of the graph, so splitting into multiple graphs and then
>> merge sorting results is pretty expensive. I guess the random access /
>> scan forward dynamic is another problematic area.
>>
>> On Tue, Mar 16, 2021 at 1:28 PM Robert Muir <rcm...@gmail.com> wrote:
>> >
>> > Maybe that is so, but we should factor in everything: such as large
>> scale indexing, not requiring whole data set to be in RAM, etc. Hey, it's
>> Lucene!
>> >
>> > Because HNSW has dominated the nightly benchmarks, I have been digging
>> through stacktraces and trying to figure out ways to make it work
>> efficiently, and I'm not sure what to do.
>> > Especially merge is painful: it seems to cause a storm of page
>> faults/random accesses due to how it works, and I don't know yet how to
>> make it better.
>> > It seems to rebuild the entire graph, spraying random accesses across a
>> "slow-wrapper" that binary searches each sub on every access.
>> > I don't see any way to even amortize the pain with some kind of bulk
>> merge trick.
>> >
>> > So if we find algorithms that scale better, I think we should lend a
>> preference towards them. For example, algorithms that allow
>> per-segment/sequential index and merge.
>> >
>> > On Tue, Mar 16, 2021 at 1:06 PM Michael Sokolov <msoko...@gmail.com>
>> wrote:
>> >>
>> >> ann-benchmarks.com maintains open benchmarks of a bunch of ANN
>> >> (approximate NN) algorithms. When we started this effort, HNSW was at
>> >> the top of the heap in most of the benchmarks.
>> >>
>> >> On Tue, Mar 16, 2021 at 12:28 PM Robert Muir <rcm...@gmail.com> wrote:
>> >> >
>> >> > Where are the alternative algorithms that work on sequential
>> iterators and don't need random access?
>> >> >
>> >> > Seems like these should be the ones we initially add to lucene, and
>> HNSW should be put aside for now? (is it a toy, or can we do it without
>> jazillions of random accesses?)
>> >> >
>> >> > On Tue, Mar 16, 2021 at 12:15 PM Michael Sokolov <msoko...@gmail.com>
>> wrote:
>> >> >>
>> >> >> There's also some good discussion on
>> >> >> https://issues.apache.org/jira/browse/LUCENE-9583 about random
>> access
>> >> >> vs iterator pattern that never got fully resolved. We said we would
>> >> >> revisit after KNN (LUCENE-9004) landed, and now it has. The usage of
>> >> >> random access is pretty well-established there, maybe we should
>> >> >> abandon the iterator API since it is redundant (you can always
>> iterate
>> >> >> over a random access API if you know the size)?
>> >> >>
>> >> >> On Tue, Mar 16, 2021 at 12:10 PM Michael Sokolov <
>> msoko...@gmail.com> wrote:
>> >> >> >
>> >> >> > Also, Tomoko re:LUCENE-9322, did it succeed? I guess we won't
>> know for
>> >> >> > sure unless someone revives
>> >> >> > https://issues.apache.org/jira/browse/LUCENE-9136 or something
>> like
>> >> >> > that
>> >> >> >
>> >> >> > On Tue, Mar 16, 2021 at 12:04 PM Michael Sokolov <
>> msoko...@gmail.com> wrote:
>> >> >> > >
>> >> >> > > Consistent plural naming makes sense to me. I think it ended up
>> >> >> > > singular because I am biased to avoid plural names unless there
>> is a
>> >> >> > > useful distinction to be made. But consistency should trump my
>> >> >> > > predilections.
>> >> >> > >
>> >> >> > > I think the reason we have search() on VectorValues is that we
>> have
>> >> >> > > LeafReader.getVectorValues() (by analogy to the DocValues
>> iterators),
>> >> >> > > but no way to access the VectorReader. Do you think we should
>> also
>> >> >> > > have LeafReader.getVectorReader()? Today it's only on
>> CodecReader.
>> >> >> > >
>> >> >> > > Re: SearchStrategy.NONE; the idea is we support efficient
>> access to
>> >> >> > > floating point values. Using BinaryDocValues for this will
>> always
>> >> >> > > require an additional decoding step. I can see that the naming
>> is
>> >> >> > > confusing there. The intent is that you index the vector
>> values, but
>> >> >> > > no additional indexing data structure. Also: the reason HNSW is
>> >> >> > > mentioned in these SearchStrategy enums is to make room for
>> other
>> >> >> > > vector indexing approaches, like LSH. There was a lot of
>> discussion
>> >> >> > > that we wanted an API that allowed for experimenting with other
>> >> >> > > techniques for indexing and searching vector values.
>> >> >> > >
>> >> >> > > Adrien, you made an analogy to PerFieldPostingsFormat (and
>> DocValues),
>> >> >> > > but I think the situation is more akin to Points, where we have
>> the
>> >> >> > > options on IndexableField. The metadata we store there
>> (dimension and
>> >> >> > > score function) don't really result in different formats, ie
>> code
>> >> >> > > paths for indexing and storage; they are more like parameters
>> to the
>> >> >> > > format, in my mind. Perhaps the situation will look different
>> when we
>> >> >> > > get our second vector indexing strategy (like LSH).
>> >> >> > >
>> >> >> > >
>> >> >> > > On Tue, Mar 16, 2021 at 10:19 AM Tomoko Uchida
>> >> >> > > <tomoko.uchida.1...@gmail.com> wrote:
>> >> >> > > >
>> >> >> > > > > Should we rename VectorFormat to VectorsFormat? This would
>> be more consistent with other file formats that use the plural, like
>> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
>> >> >> > > >
>> >> >> > > > +1 for using plural form for consistency - if we reconsider
>> the names, how about VectorValuesFormat so that it follows the naming
>> convention for XXXValues?
>> >> >> > > >
>> >> >> > > > DocValuesFormat / DocValues
>> >> >> > > > PointValuesFormat / PointValues
>> >> >> > > > VectorValuesFormat / VectorValues (currently, VectorFormat /
>> VectorValues)
>> >> >> > > >
>> >> >> > > > > Should SearchStrategy constants avoid explicit references
>> to HNSW?
>> >> >> > > >
>> >> >> > > > Also +1 for decoupling HNSW specific implementations from
>> general vectors, though I am not fully sure if we can strictly separate the
>> similarity metrics and search algorithms for vectors.
>> >> >> > > > LUCENE-9322 (unified vectors API) was resolved months ago,
>> does it achieve its goal? I haven't followed the issue in months because of
>> my laziness...
>> >> >> > > >
>> >> >> > > > Thanks,
>> >> >> > > > Tomoko
>> >> >> > > >
>> >> >> > > >
>> >> >> > > > 2021年3月16日(火) 19:32 Adrien Grand <jpou...@gmail.com>:
>> >> >> > > >>
>> >> >> > > >> Hello,
>> >> >> > > >>
>> >> >> > > >> I've tried to catch up on the vector API and I have the
>> following questions. I've tried to read through discussions on JIRA first
>> in case it had been covered, but it's possible I missed some relevant ones.
>> >> >> > > >>
>> >> >> > > >> Should VectorValues#search be on VectorReader instead? It
>> felt a bit odd to me to have the search logic on the iterator.
>> >> >> > > >>
>> >> >> > > >> Do we need SearchStrategy.NONE? Documentation suggests that
>> it allows storing vectors but that NN search won't be supported. This looks
>> like a use-case for binary doc values to me? It also slightly caught me by
>> surprise due to the inconsistency with IndexOptions.NONE, which means "do
>> not index this field" (and likewise for DocValuesType.NONE), so I first
>> assumed that SearchStrategy.NONE also meant "do not index this field as a
>> vector".
>> >> >> > > >>
>> >> >> > > >> While postings and doc-value formats allow per-field
>> configuration via PerFieldPostingsFormat/PerFieldDocValuesFormat, vectors
>> use a different mechanism where VectorField#createHnswType sets attributes
>> on the field type that the vectors writer then reads. Should we have a
>> PerFieldVectorsFormat instead and configure these options via the vectors
>> format?
>> >> >> > > >>
>> >> >> > > >> Should SearchStrategy constants avoid explicit references to
>> HNSW? The rest of the API seems to try to be agnostic of the way that NN
>> search is implemented. Could we make SearchStrategy only about the
>> similarity metric that is used for vectors? This particular point seems
>> discussed on LUCENE-9322 but I couldn't find the conclusion.
>> >> >> > > >>
>> >> >> > > >> Should we rename VectorFormat to VectorsFormat? This would
>> be more consistent with other file formats that use the plural, like
>> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
>> >> >> > > >>
>> >> >> > > >> --
>> >> >> > > >> Adrien
>> >> >>
>> >> >>
>> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>> >> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>
>
> --
> --
> Dmitry Kan
> Luke Toolbox: http://github.com/DmitryKey/luke
> Blog: http://dmitrykan.blogspot.com
> Twitter: http://twitter.com/dmitrykan
> SemanticAnalyzer: www.semanticanalyzer.info
>

Re: Questions about the new vector API

Reply via email to