RE: Vector based store and ANN

Pedram Rezaei Fri, 01 Mar 2019 16:23:35 -0800

Hi there,

Thank you for sharing your thoughts. I am finding them extremely useful and to 
be honest exciting!

Regarding the vector-based scoring, you are 100% correct. There are many ways 
of having an efficient vector-based similarity scorer implemented on top of an 
encoded vector stored at the document level in Lucene.

As you have rightly pointed out, this in itself might not be sufficient for 
large indexes. After all, the engine would need to read the vector per document 
and then calculate similarity.

LSH or similar n-pass (n>1) techniques are pretty interesting and certainly can 
get us closer to using the existing index for lookup. As you rightly point out 
below, it can come at a cost either to the performance or the precision.

I am personally very intrigued by the new generation of vector-based indexes 
such as Facebook’s FAISS<https://github.com/facebookresearch/faiss> library for 
similarity search and clustering of dense vectors used as part of larger search 
offerings. Do you think there might be a world in which Lucene would want to 
have first-class support for vector-based searches? I think with such a 
capability, we might open the door for new and innovative ways of information 
retrieval.

I am grateful to you all for your insights and this fascinating discussion!

Pedram

P.S. How do I join 
https://relevancy.slack.com<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Frelevancy.slack.com&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636870794499757411&sdata=p41mJtw39mq5qG5oy3MZOEWHT%2BrfFqeANLhFcLOtIIo%3D&reserved=0>?

From: René Kriegler <r...@rene-kriegler.de>
Sent: Friday, March 1, 2019 3:24 PM
To: Pedram Rezaei <pedr...@microsoft.com>
Cc: dev@lucene.apache.org; Radhakrishnan Srikanth (SRIKANTH) 
<rsri...@microsoft.com>; Arun Sacheti <ar...@bing.com>; Kun Wu 
<wu....@microsoft.com>; Junhua Wang <junhua.w...@microsoft.com>; Jason Li 
<ja...@microsoft.com>
Subject: Re: Vector based store and ANN

Hi there,

Thank you for looping me in. Just a few random thoughts on this topic:

- I’ve heard ;-) that this ES plugin is fast for vector-based scoring: 
https://github.com/StaySense/fast-cosine-similarity<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FStaySense%2Ffast-cosine-similarity&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636870794499747411&sdata=kHEdnvi3o9ZfSAiE%2FJQIhbI54Zf%2BLEwr%2F%2B40tpFDnv8%3D&reserved=0>.
 The links in the ‘General’ section provide some history. As far as I can see, 
there is nothing which couldn’t be implemented at Lucene level.

- For retrieval, I think a two-pass approach looks like something worth trying 
out. First pass: look up documents in a low dimensional space (maybe produced 
via LSH) and then, in the second pass, calculate vector distances in the 
high-dimensional space just for the documents that resulted from the first 
pass. This solution will come with some compromises to make. For example, a 
higher dimensionality of LSH would increase precision but also produce more 
hash tokens and make the lookup slower, especially for large indexes.

- Day 2 of Haystack 2019 
(https://haystackconf.com/agenda/<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhaystackconf.com%2Fagenda%2F&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636870794499757411&sdata=6cD7%2BHoxVsuozhLN27m7Jmowv3D4CUYtVHCipGRO8Ss%3D&reserved=0>)
 will have a talk by Simon Hughes about ’Search with Vectors’. There is a 
channel on this topic at OpenSource Connections’ search relevance Slack 
(https://relevancy.slack.com<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Frelevancy.slack.com&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636870794499757411&sdata=p41mJtw39mq5qG5oy3MZOEWHT%2BrfFqeANLhFcLOtIIo%3D&reserved=0>)
 and Simon has been one of the drivers of the discussion.

Best,
René

On 1 Mar 2019, at 20:51, Pedram Rezaei 
<pedr...@microsoft.com<mailto:pedr...@microsoft.com>> wrote:

Thank you for sharing, and it is exciting to see how advanced your thinking is.

Yes, the idea is the same idea with an extra step that Rene also seems to elude 
to 
here<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2FRenKriegler%2Fa-picture-is-worth-a-thousand-words-93680178&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636870794499767411&sdata=BRJS4wkx7vRY8CX%2FiPvvltx41uy%2BwBAwtMEEoE1Gcag%3D&reserved=0>
 in his comment. Instead of using these types of techniques only at the scoring 
time, we can use them for information retrieval from the index. This will allow 
us to, for example, index millions of images and quickly and efficiently lookup 
the most relevant images.

I would love to hear yours and others thoughts on this. I think there is a 
great opportunity here, but it would need a lot of input and guidance from the 
experts here.

Thank you,

Pedram

From: David Smiley <david.w.smi...@gmail.com<mailto:david.w.smi...@gmail.com>>
Sent: Friday, March 1, 2019 12:11 PM
To: dev@lucene.apache.org<mailto:dev@lucene.apache.org>
Cc: Radhakrishnan Srikanth (SRIKANTH) 
<rsri...@microsoft.com<mailto:rsri...@microsoft.com>>; Arun Sacheti 
<ar...@bing.com<mailto:ar...@bing.com>>; Kun Wu 
<wu....@microsoft.com<mailto:wu....@microsoft.com>>; Junhua Wang 
<junhua.w...@microsoft.com<mailto:junhua.w...@microsoft.com>>; Jason Li 
<ja...@microsoft.com<mailto:ja...@microsoft.com>>; René Kriegler 
<p...@rene-kriegler.com<mailto:p...@rene-kriegler.com>>
Subject: Re: Vector based store and ANN

This presentation by Rene Kriegler at Haystack 2018 was a real eye-opener to me 
on this subject: 
https://haystackconf.com/2018/relevance-scoring/<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhaystackconf.com%2F2018%2Frelevance-scoring%2F&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499777395&sdata=TbYqHGyZ4Cq6Zhx8FSr9ES90GVw%2BkHo7r5epAVYLlog%3D&reserved=0>.
 Uses random-projection forests which is a very clever technique.  (CC'ing Rene)

~ David
On Fri, Mar 1, 2019 at 1:30 PM Pedram Rezaei 
<pedr...@microsoft.com.invalid<mailto:pedr...@microsoft.com.invalid>> wrote:
Hi there,

Thank you for the responses. Yes, we have a few scenarios in mind that can 
benefit from a vector-based index optimized for ANN searches:

  *   Advanced, optimized, and high precision visual search: For this to work, 
we would convert the images to their vector representations and then use 
algorithms and implementations such as 
SPTAG<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FMicrosoft%2FSPTAG&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499777395&sdata=qRd%2B5ieCH2duJVxBxHbj4rVy03cHhbW2QxFGLJ6F%2BNs%3D&reserved=0>,

FAISS<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffacebookresearch%2Ffaiss&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499787389&sdata=%2BWivx1i5cTAypkWJUaWXLq32ShZ9ncPEIuUzcV5lqtk%3D&reserved=0>,
 and 
HNSWLIB<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnmslib%2Fhnswlib&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499787389&sdata=2ZxNZFReYuryCjGak9Szz5BmgjT9G59IBOw9q3RlCbo%3D&reserved=0>.
  *   Advanced document retrieval: Using a numerical vector representation of a 
document, we could improve the search result
  *   Nearest neighbor queries: discovering the nearest neighbors to a given 
query could also benefit from these ANN algorithms (although doesn’t 
necessarily need the vector based index)

I would be grateful to hear your thoughts and whether the community is open to 
a conversation on this topic with my team.

Thanks,

Pedram

From: J. Delgado <joaquin.delg...@gmail.com<mailto:joaquin.delg...@gmail.com>>
Sent: Thursday, February 28, 2019 7:38 AM
To: dev@lucene.apache.org<mailto:dev@lucene.apache.org>
Cc: Radhakrishnan Srikanth (SRIKANTH) 
<rsri...@microsoft.com<mailto:rsri...@microsoft.com>>
Subject: Re: Vector based store and ANN

Lucene’s scoring function (which I believe is okapi BM25
https://en.m.wikipedia.org/wiki/Okapi_BM25<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.m.wikipedia.org%2Fwiki%2FOkapi_BM25&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499797379&sdata=E0%2BLqnkwPxvJlL2ENYKgv0HDQxyPkB6iRw467PMBmRY%3D&reserved=0>)
 is a kind of nearest neighbor using the TF-IDF vector representation of 
documents and query. Are you interested in ANN to be applied to a different 
kind of vector representation, say for example Doc2Vec?

On Thu, Feb 28, 2019 at 5:59 AM Adrien Grand 
<jpou...@gmail.com<mailto:jpou...@gmail.com>> wrote:
Hi Pedram,

We don't have much in this area, but I'm hearing increasing interest
so it'd be nice to get better there! The closest that we have is this
class that can search for nearest neighbors for a vector of up to 8
dimensions: 
https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/document/FloatPointNearestNeighbor.java<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Flucene-solr%2Fblob%2Fmaster%2Flucene%2Fsandbox%2Fsrc%2Fjava%2Forg%2Fapache%2Flucene%2Fdocument%2FFloatPointNearestNeighbor.java&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499807382&sdata=GvDfvwmayyPuk%2FyzdRwV6iz4dvEZNyZ%2FFjl%2BjKYKCAM%3D&reserved=0>.

On Wed, Feb 27, 2019 at 1:44 AM Pedram Rezaei
<pedr...@microsoft.com.invalid<mailto:pedr...@microsoft.com.invalid>> wrote:
>
> Hi there,
>
>
>
> Is there a way to store numerical vectors (vector based index) and perform 
> search based on Approximate Nearest Neighbor class of algorithms in Lucene?
>
>
>
> If not, has there been any interests in the topic so far?
>
>
>
> Thanks,
>
>
>
> Pedram

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: 
dev-unsubscr...@lucene.apache.org<mailto:dev-unsubscr...@lucene.apache.org>
For additional commands, e-mail: 
dev-h...@lucene.apache.org<mailto:dev-h...@lucene.apache.org>
--
Lucene/Solr Search Committer (PMC), Developer, Author, Speaker
LinkedIn: 
http://linkedin.com/in/davidwsmiley<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flinkedin.com%2Fin%2Fdavidwsmiley&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499807382&sdata=f4y0dYTDXxe7HMCZMbk9d5S%2BX8q93Yo7CkROITsyeNo%3D&reserved=0>
 | Book: 
http://www.solrenterprisesearchserver.com<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.solrenterprisesearchserver.com&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499817365&sdata=9pkGzZID%2FeuGEdd90ZOrpRUybWLVV2H7vHUO4kp9%2FA4%3D&reserved=0>

RE: Vector based store and ANN

Reply via email to