Apparently, there is already an implementation along the lines discussed here:
https://blog.insightdatascience.com/elastik-nearest-neighbors-4b1f6821bd62 https://github.com/alexklibisz/elastik-nearest-neighbors/ -- J On Fri, Mar 1, 2019 at 4:23 PM Pedram Rezaei <pedr...@microsoft.com.invalid> wrote: > Hi there, > > > > Thank you for sharing your thoughts. I am finding them extremely useful > and to be honest exciting! > > > > Regarding the vector-based scoring, you are 100% correct. There are many > ways of having an efficient vector-based similarity scorer implemented on > top of an encoded vector stored at the document level in Lucene. > > > > As you have rightly pointed out, this in itself might not be sufficient > for large indexes. After all, the engine would need to read the vector per > document and then calculate similarity. > > > > LSH or similar n-pass (n>1) techniques are pretty interesting and > certainly can get us closer to using the existing index for lookup. As you > rightly point out below, it can come at a cost either to the performance or > the precision. > > > > I am personally very intrigued by the new generation of vector-based > indexes such as Facebook’s FAISS > <https://github.com/facebookresearch/faiss> library for similarity search > and clustering of dense vectors used as part of larger search offerings. Do > you think there might be a world in which Lucene would want to have > first-class support for vector-based searches? I think with such a > capability, we might open the door for new and innovative ways of > information retrieval. > > > > I am grateful to you all for your insights and this fascinating discussion! > > > > Pedram > > > > P.S. How do I join https://relevancy.slack.com > <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Frelevancy.slack.com&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636870794499757411&sdata=p41mJtw39mq5qG5oy3MZOEWHT%2BrfFqeANLhFcLOtIIo%3D&reserved=0> > ? > > > > *From:* René Kriegler <r...@rene-kriegler.de> > *Sent:* Friday, March 1, 2019 3:24 PM > *To:* Pedram Rezaei <pedr...@microsoft.com> > *Cc:* dev@lucene.apache.org; Radhakrishnan Srikanth (SRIKANTH) < > rsri...@microsoft.com>; Arun Sacheti <ar...@bing.com>; Kun Wu < > wu....@microsoft.com>; Junhua Wang <junhua.w...@microsoft.com>; Jason Li < > ja...@microsoft.com> > *Subject:* Re: Vector based store and ANN > > > > Hi there, > > > > Thank you for looping me in. Just a few random thoughts on this topic: > > > > - I’ve heard ;-) that this ES plugin is fast for vector-based scoring: > https://github.com/StaySense/fast-cosine-similarity > <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FStaySense%2Ffast-cosine-similarity&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636870794499747411&sdata=kHEdnvi3o9ZfSAiE%2FJQIhbI54Zf%2BLEwr%2F%2B40tpFDnv8%3D&reserved=0>. > The links in the ‘General’ section provide some history. As far as I can > see, there is nothing which couldn’t be implemented at Lucene level. > > > > - For retrieval, I think a two-pass approach looks like something worth > trying out. First pass: look up documents in a low dimensional space (maybe > produced via LSH) and then, in the second pass, calculate vector distances > in the high-dimensional space just for the documents that resulted from the > first pass. This solution will come with some compromises to make. For > example, a higher dimensionality of LSH would increase precision but also > produce more hash tokens and make the lookup slower, especially for large > indexes. > > > > - Day 2 of Haystack 2019 (https://haystackconf.com/agenda/ > <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhaystackconf.com%2Fagenda%2F&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636870794499757411&sdata=6cD7%2BHoxVsuozhLN27m7Jmowv3D4CUYtVHCipGRO8Ss%3D&reserved=0>) > will have a talk by Simon Hughes about ’Search with Vectors’. There is a > channel on this topic at OpenSource Connections’ search relevance Slack ( > https://relevancy.slack.com > <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Frelevancy.slack.com&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636870794499757411&sdata=p41mJtw39mq5qG5oy3MZOEWHT%2BrfFqeANLhFcLOtIIo%3D&reserved=0>) > and Simon has been one of the drivers of the discussion. > > > > Best, > > René > > > > > > On 1 Mar 2019, at 20:51, Pedram Rezaei <pedr...@microsoft.com> wrote: > > > > Thank you for sharing, and it is exciting to see how advanced your > thinking is. > > > > Yes, the idea is the same idea with an extra step that Rene also seems to > elude to here > <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2FRenKriegler%2Fa-picture-is-worth-a-thousand-words-93680178&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636870794499767411&sdata=BRJS4wkx7vRY8CX%2FiPvvltx41uy%2BwBAwtMEEoE1Gcag%3D&reserved=0> > in his comment. Instead of using these types of techniques only at the > scoring time, we can use them for information retrieval from the index. > This will allow us to, for example, index millions of images and quickly > and efficiently lookup the most relevant images. > > > > I would love to hear yours and others thoughts on this. I think there is a > great opportunity here, but it would need a lot of input and guidance from > the experts here. > > > > Thank you, > > > > Pedram > > > > *From:* David Smiley <david.w.smi...@gmail.com> > *Sent:* Friday, March 1, 2019 12:11 PM > *To:* dev@lucene.apache.org > *Cc:* Radhakrishnan Srikanth (SRIKANTH) <rsri...@microsoft.com>; Arun > Sacheti <ar...@bing.com>; Kun Wu <wu....@microsoft.com>; Junhua Wang < > junhua.w...@microsoft.com>; Jason Li <ja...@microsoft.com>; René Kriegler > <p...@rene-kriegler.com> > *Subject:* Re: Vector based store and ANN > > > > This presentation by Rene Kriegler at Haystack 2018 was a real eye-opener > to me on this subject: https://haystackconf.com/2018/relevance-scoring/ > <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhaystackconf.com%2F2018%2Frelevance-scoring%2F&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499777395&sdata=TbYqHGyZ4Cq6Zhx8FSr9ES90GVw%2BkHo7r5epAVYLlog%3D&reserved=0>. > Uses > random-projection forests which is a very clever technique. (CC'ing Rene) > > > > ~ David > > On Fri, Mar 1, 2019 at 1:30 PM Pedram Rezaei < > pedr...@microsoft.com.invalid> wrote: > > Hi there, > > > > Thank you for the responses. Yes, we have a few scenarios in mind that can > benefit from a vector-based index optimized for ANN searches: > > > > - Advanced, optimized, and high precision visual search: For this to > work, we would convert the images to their vector representations and then > use algorithms and implementations such as SPTAG > > <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FMicrosoft%2FSPTAG&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499777395&sdata=qRd%2B5ieCH2duJVxBxHbj4rVy03cHhbW2QxFGLJ6F%2BNs%3D&reserved=0> > , FAISS > > <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffacebookresearch%2Ffaiss&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499787389&sdata=%2BWivx1i5cTAypkWJUaWXLq32ShZ9ncPEIuUzcV5lqtk%3D&reserved=0>, > and HNSWLIB > > <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnmslib%2Fhnswlib&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499787389&sdata=2ZxNZFReYuryCjGak9Szz5BmgjT9G59IBOw9q3RlCbo%3D&reserved=0> > . > - Advanced document retrieval: Using a numerical vector representation > of a document, we could improve the search result > - Nearest neighbor queries: discovering the nearest neighbors to a > given query could also benefit from these ANN algorithms (although doesn’t > necessarily need the vector based index) > > > > I would be grateful to hear your thoughts and whether the community is > open to a conversation on this topic with my team. > > > > Thanks, > > > > Pedram > > > > *From:* J. Delgado <joaquin.delg...@gmail.com> > *Sent:* Thursday, February 28, 2019 7:38 AM > *To:* dev@lucene.apache.org > *Cc:* Radhakrishnan Srikanth (SRIKANTH) <rsri...@microsoft.com> > *Subject:* Re: Vector based store and ANN > > > > Lucene’s scoring function (which I believe is okapi BM25 > > https://en.m.wikipedia.org/wiki/Okapi_BM25 > <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.m.wikipedia.org%2Fwiki%2FOkapi_BM25&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499797379&sdata=E0%2BLqnkwPxvJlL2ENYKgv0HDQxyPkB6iRw467PMBmRY%3D&reserved=0>) > is a kind of nearest neighbor using the TF-IDF vector representation of > documents and query. Are you interested in ANN to be applied to a different > kind of vector representation, say for example Doc2Vec? > > > > On Thu, Feb 28, 2019 at 5:59 AM Adrien Grand <jpou...@gmail.com> wrote: > > Hi Pedram, > > We don't have much in this area, but I'm hearing increasing interest > so it'd be nice to get better there! The closest that we have is this > class that can search for nearest neighbors for a vector of up to 8 > dimensions: > https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/document/FloatPointNearestNeighbor.java > <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Flucene-solr%2Fblob%2Fmaster%2Flucene%2Fsandbox%2Fsrc%2Fjava%2Forg%2Fapache%2Flucene%2Fdocument%2FFloatPointNearestNeighbor.java&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499807382&sdata=GvDfvwmayyPuk%2FyzdRwV6iz4dvEZNyZ%2FFjl%2BjKYKCAM%3D&reserved=0> > . > > On Wed, Feb 27, 2019 at 1:44 AM Pedram Rezaei > <pedr...@microsoft.com.invalid> wrote: > > > > Hi there, > > > > > > > > Is there a way to store numerical vectors (vector based index) and > perform search based on Approximate Nearest Neighbor class of algorithms in > Lucene? > > > > > > > > If not, has there been any interests in the topic so far? > > > > > > > > Thanks, > > > > > > > > Pedram > > > > -- > Adrien > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > -- > > Lucene/Solr Search Committer (PMC), Developer, Author, Speaker > > LinkedIn: http://linkedin.com/in/davidwsmiley > <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flinkedin.com%2Fin%2Fdavidwsmiley&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499807382&sdata=f4y0dYTDXxe7HMCZMbk9d5S%2BX8q93Yo7CkROITsyeNo%3D&reserved=0> > | Book: http://www.solrenterprisesearchserver.com > <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.solrenterprisesearchserver.com&data=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499817365&sdata=9pkGzZID%2FeuGEdd90ZOrpRUybWLVV2H7vHUO4kp9%2FA4%3D&reserved=0> > > >