Re: Questions about the new vector API

Dmitry Kan Sat, 27 Mar 2021 03:01:15 -0700

Michael,

I got some interest in this area and have been doing comparative study of
different KNN implementations and blogging about it.


Did you use nmslib for HNSW implementation or something else?

On Tue, 16 Mar 2021 at 22:47, Michael Sokolov <msoko...@gmail.com> wrote:

> Yeah, HNSW is problematic in a few ways: (1) merging is costly due to
> the need to completely recreate the graph. (2) searching across a
> segmented index sacrifices much of the performance benefit of HNSW
> since the cost of searching HNSW graphs scales ~logarithmically with
> the size of the graph, so splitting into multiple graphs and then
> merge sorting results is pretty expensive. I guess the random access /
> scan forward dynamic is another problematic area.
>
> On Tue, Mar 16, 2021 at 1:28 PM Robert Muir <rcm...@gmail.com> wrote:
> >
> > Maybe that is so, but we should factor in everything: such as large
> scale indexing, not requiring whole data set to be in RAM, etc. Hey, it's
> Lucene!
> >
> > Because HNSW has dominated the nightly benchmarks, I have been digging
> through stacktraces and trying to figure out ways to make it work
> efficiently, and I'm not sure what to do.
> > Especially merge is painful: it seems to cause a storm of page
> faults/random accesses due to how it works, and I don't know yet how to
> make it better.
> > It seems to rebuild the entire graph, spraying random accesses across a
> "slow-wrapper" that binary searches each sub on every access.
> > I don't see any way to even amortize the pain with some kind of bulk
> merge trick.
> >
> > So if we find algorithms that scale better, I think we should lend a
> preference towards them. For example, algorithms that allow
> per-segment/sequential index and merge.
> >
> > On Tue, Mar 16, 2021 at 1:06 PM Michael Sokolov <msoko...@gmail.com>
> wrote:
> >>
> >> ann-benchmarks.com maintains open benchmarks of a bunch of ANN
> >> (approximate NN) algorithms. When we started this effort, HNSW was at
> >> the top of the heap in most of the benchmarks.
> >>
> >> On Tue, Mar 16, 2021 at 12:28 PM Robert Muir <rcm...@gmail.com> wrote:
> >> >
> >> > Where are the alternative algorithms that work on sequential
> iterators and don't need random access?
> >> >
> >> > Seems like these should be the ones we initially add to lucene, and
> HNSW should be put aside for now? (is it a toy, or can we do it without
> jazillions of random accesses?)
> >> >
> >> > On Tue, Mar 16, 2021 at 12:15 PM Michael Sokolov <msoko...@gmail.com>
> wrote:
> >> >>
> >> >> There's also some good discussion on
> >> >> https://issues.apache.org/jira/browse/LUCENE-9583 about random
> access
> >> >> vs iterator pattern that never got fully resolved. We said we would
> >> >> revisit after KNN (LUCENE-9004) landed, and now it has. The usage of
> >> >> random access is pretty well-established there, maybe we should
> >> >> abandon the iterator API since it is redundant (you can always
> iterate
> >> >> over a random access API if you know the size)?
> >> >>
> >> >> On Tue, Mar 16, 2021 at 12:10 PM Michael Sokolov <msoko...@gmail.com>
> wrote:
> >> >> >
> >> >> > Also, Tomoko re:LUCENE-9322, did it succeed? I guess we won't know
> for
> >> >> > sure unless someone revives
> >> >> > https://issues.apache.org/jira/browse/LUCENE-9136 or something
> like
> >> >> > that
> >> >> >
> >> >> > On Tue, Mar 16, 2021 at 12:04 PM Michael Sokolov <
> msoko...@gmail.com> wrote:
> >> >> > >
> >> >> > > Consistent plural naming makes sense to me. I think it ended up
> >> >> > > singular because I am biased to avoid plural names unless there
> is a
> >> >> > > useful distinction to be made. But consistency should trump my
> >> >> > > predilections.
> >> >> > >
> >> >> > > I think the reason we have search() on VectorValues is that we
> have
> >> >> > > LeafReader.getVectorValues() (by analogy to the DocValues
> iterators),
> >> >> > > but no way to access the VectorReader. Do you think we should
> also
> >> >> > > have LeafReader.getVectorReader()? Today it's only on
> CodecReader.
> >> >> > >
> >> >> > > Re: SearchStrategy.NONE; the idea is we support efficient access
> to
> >> >> > > floating point values. Using BinaryDocValues for this will always
> >> >> > > require an additional decoding step. I can see that the naming is
> >> >> > > confusing there. The intent is that you index the vector values,
> but
> >> >> > > no additional indexing data structure. Also: the reason HNSW is
> >> >> > > mentioned in these SearchStrategy enums is to make room for other
> >> >> > > vector indexing approaches, like LSH. There was a lot of
> discussion
> >> >> > > that we wanted an API that allowed for experimenting with other
> >> >> > > techniques for indexing and searching vector values.
> >> >> > >
> >> >> > > Adrien, you made an analogy to PerFieldPostingsFormat (and
> DocValues),
> >> >> > > but I think the situation is more akin to Points, where we have
> the
> >> >> > > options on IndexableField. The metadata we store there
> (dimension and
> >> >> > > score function) don't really result in different formats, ie code
> >> >> > > paths for indexing and storage; they are more like parameters to
> the
> >> >> > > format, in my mind. Perhaps the situation will look different
> when we
> >> >> > > get our second vector indexing strategy (like LSH).
> >> >> > >
> >> >> > >
> >> >> > > On Tue, Mar 16, 2021 at 10:19 AM Tomoko Uchida
> >> >> > > <tomoko.uchida.1...@gmail.com> wrote:
> >> >> > > >
> >> >> > > > > Should we rename VectorFormat to VectorsFormat? This would
> be more consistent with other file formats that use the plural, like
> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
> >> >> > > >
> >> >> > > > +1 for using plural form for consistency - if we reconsider
> the names, how about VectorValuesFormat so that it follows the naming
> convention for XXXValues?
> >> >> > > >
> >> >> > > > DocValuesFormat / DocValues
> >> >> > > > PointValuesFormat / PointValues
> >> >> > > > VectorValuesFormat / VectorValues (currently, VectorFormat /
> VectorValues)
> >> >> > > >
> >> >> > > > > Should SearchStrategy constants avoid explicit references to
> HNSW?
> >> >> > > >
> >> >> > > > Also +1 for decoupling HNSW specific implementations from
> general vectors, though I am not fully sure if we can strictly separate the
> similarity metrics and search algorithms for vectors.
> >> >> > > > LUCENE-9322 (unified vectors API) was resolved months ago,
> does it achieve its goal? I haven't followed the issue in months because of
> my laziness...
> >> >> > > >
> >> >> > > > Thanks,
> >> >> > > > Tomoko
> >> >> > > >
> >> >> > > >
> >> >> > > > 2021年3月16日(火) 19:32 Adrien Grand <jpou...@gmail.com>:
> >> >> > > >>
> >> >> > > >> Hello,
> >> >> > > >>
> >> >> > > >> I've tried to catch up on the vector API and I have the
> following questions. I've tried to read through discussions on JIRA first
> in case it had been covered, but it's possible I missed some relevant ones.
> >> >> > > >>
> >> >> > > >> Should VectorValues#search be on VectorReader instead? It
> felt a bit odd to me to have the search logic on the iterator.
> >> >> > > >>
> >> >> > > >> Do we need SearchStrategy.NONE? Documentation suggests that
> it allows storing vectors but that NN search won't be supported. This looks
> like a use-case for binary doc values to me? It also slightly caught me by
> surprise due to the inconsistency with IndexOptions.NONE, which means "do
> not index this field" (and likewise for DocValuesType.NONE), so I first
> assumed that SearchStrategy.NONE also meant "do not index this field as a
> vector".
> >> >> > > >>
> >> >> > > >> While postings and doc-value formats allow per-field
> configuration via PerFieldPostingsFormat/PerFieldDocValuesFormat, vectors
> use a different mechanism where VectorField#createHnswType sets attributes
> on the field type that the vectors writer then reads. Should we have a
> PerFieldVectorsFormat instead and configure these options via the vectors
> format?
> >> >> > > >>
> >> >> > > >> Should SearchStrategy constants avoid explicit references to
> HNSW? The rest of the API seems to try to be agnostic of the way that NN
> search is implemented. Could we make SearchStrategy only about the
> similarity metric that is used for vectors? This particular point seems
> discussed on LUCENE-9322 but I couldn't find the conclusion.
> >> >> > > >>
> >> >> > > >> Should we rename VectorFormat to VectorsFormat? This would be
> more consistent with other file formats that use the plural, like
> PostingsFormat, DocValuesFormat, TermVectorsFormat, etc.?
> >> >> > > >>
> >> >> > > >> --
> >> >> > > >> Adrien
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-- 
-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info

Re: Questions about the new vector API

Reply via email to