[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15262169#comment-15262169
 ] 

Andy Hind commented on LUCENE-6968:
-----------------------------------

I agree a pure token stream test makes sense. The only concern I have is about 
testing token filters chained together. Chaining shingle generation with min 
hashing requires that the underlying token stream has its state reset correctly 
for reuse. As I missed this, I added a test to cover it. Is there somewhere 
else in the test framework that covers this case? Some randomised chaining of 
filters?? Perhaps chaining is more of a SOLR thing.

I would prefer to stick with a 128/96 bit hash. The link below [1] "suggests" 
5-shingles become well distributed. Link [2] says upto 2/3 of all possible 
trigrams have been seen in 30 years of news  articles. So it seems we can 
expect to see many of the possible 5-shingles. Some bioinformatic use cases may 
also require this.  

{quote}
[1] 
http://googleresearch.blogspot.co.uk/2006/08/all-our-n-gram-are-belong-to-you.html
[2] http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf
{quote}

I was not that keen to add Guava! However, it was already there somewhere.
I am happy if this moves off into a separate module. I will also look to see 
how this dependency could be removed.

Perhaps we should have some time to consider how to include the fingerprint 
length (sum of the min set size over all hashes) to support an unbiased query. 
An unbiased query would be more difficult to build correctly. Some 
fingerprint/LSH query support and tests may make sense. Some other statistics 
may also be useful in generating faster queries that find similar documents 
using some threshold and probability of meeting that threshold. 

> LSH Filter
> ----------
>
>                 Key: LUCENE-6968
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6968
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Cao Manh Dat
>            Assignee: Tommaso Teofili
>         Attachments: LUCENE-6968.4.patch, LUCENE-6968.patch, 
> LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to