[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257732#comment-15257732
 ] 

Cao Manh Dat commented on LUCENE-6968:
--------------------------------------

Thanks for the link. I totally agree that keeping some lowest values for single 
hash function would be better. 

But in the wiki doc. It pointed out that the estimator formulation for "variant 
with a single hash function" is not same as the estimator formulation for 
"variant with many hash function". So the generated query must be different for 
each case.

For example, in case we use single hash function and keep some lowest values :
1. We have doc A = [1, 2, 5, 6, 7], doc B = [3, 4, 5, 6, 7]
3. So jaccard(A,B) = |hk(A U B) ∩ hk(A) ∩ hk(B)| / k = |{5}| / k = 0.2

> LSH Filter
> ----------
>
>                 Key: LUCENE-6968
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6968
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Cao Manh Dat
>         Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to