[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15259924#comment-15259924 ]
Andy Hind commented on LUCENE-6968: ----------------------------------- This comes down to "what is a good estimate of |A U B|" and do we need it for the use case. For query, the main use case is finding documents like one source document. So we are comparing Doc A with all other documents. What we need is a measure that is fair for A -> B, A -> C. We probably do not care about B -> C. If we take the fingerprint from A and just OR the bits together into a big query we have a consistent measure of similarity of A with any other document. This particular measure is biased. For a start Sim(A, B) is not equal to Sim(B, A). But for this use case that may not matter. This measure contains both inclusion and duplication which may be a good thing. It is also pretty intuitive what it means. This is (A ∩ B)/|A|. If we want Sim(A, B) = Sim(B, A) then we need some consistent measure/sample of |A U B| to normalise our measure/estimate of A ∩ B. This could be (|A| + |B| - |A ∩ B|), or some similar estimate. We could use the size of the fingerprint set. We could keep the full ordered set of hashes and have extra statistics like the total number of hashes and total number of unique hashes. For two short documents, where there are fewer fingerprints than the maximim, we have the full set. For two larger docs we have an estimate of these based in the min hash sets. You can argue "min of many hashes" is a random sample with replacement and "min set of one hash" is a random sample with out replacement (min set); if your hash is good. If the sample is small compared with the set of all hashes the arguments converge. If we were doing arbitrary comparisons between any pair of documents then we would have to use an unbiased estimator. Finding candidate pairs, moving onto clustering, ... > LSH Filter > ---------- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Cao Manh Dat > Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org