[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15259924#comment-15259924
 ] 

Andy Hind commented on LUCENE-6968:
-----------------------------------

This comes down to "what is a good estimate of |A U B|" and do we need it for 
the use case.

For query, the main use case is finding documents like one source document. So 
we are comparing Doc A with all other documents. What we need is a measure that 
is fair for A -> B, A -> C. We probably do not care about B -> C. If we take 
the fingerprint from A and just OR the bits together into a big query we have a 
consistent measure of similarity of A with any other document. This particular 
measure is biased. For a start Sim(A, B) is not equal to Sim(B, A). But for 
this use case that may not matter. This measure contains both inclusion and 
duplication which may be a good thing. It is also pretty intuitive what it 
means. This is (A ∩ B)/|A|.

If we want Sim(A, B) = Sim(B, A) then we need some consistent measure/sample of 
|A U B| to normalise our measure/estimate of A ∩ B. This could be (|A| + |B| - 
|A ∩ B|), or some similar estimate. We could use the size of the fingerprint 
set. We could keep the full ordered set of hashes and have extra statistics 
like the total number of hashes and total number of unique hashes. 

For two short documents, where there are fewer fingerprints than the maximim, 
we have the full set.
For two larger docs we have an estimate of these based in the min hash sets.  
You can argue "min of many hashes"  is a random sample with replacement and 
"min set of one hash" is a random sample with out replacement (min set); if 
your hash is good. If the sample is small compared with the set of all hashes 
the arguments converge. 

If we were doing arbitrary comparisons between any pair of documents then we 
would have to use an unbiased estimator. Finding candidate pairs, moving onto 
clustering, ...


> LSH Filter
> ----------
>
>                 Key: LUCENE-6968
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6968
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Cao Manh Dat
>         Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to