[jira] [Comment Edited] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328054#comment-15328054 ] Andy Hind edited comment on LUCENE-6968 at 6/13/16 7:55 PM: Hi Tommaso, the MinHashFilterTest was running fine. It was JapaneseNumberFilter that was failing intermittently. I think on one of the evil test cases. LongPair should implement equals (and probably hashCode if it will be reused) as it goes into a TreeSet. An over sight on my part. FWIW, as far as I can tell, the change in patch 6 was included in 5. was (Author: andyhind): Hi Tommaso, the MinHashFilterTest was running fine. It was JapaneseNumberFilter that was failing intermittently. I think on one of the evil test cases. LongPair should implement equals (and probably hashCode if it will be reused) as it goes into a TreeSet. An over sight on my part. > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat >Assignee: Tommaso Teofili > Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, > LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15263867#comment-15263867 ] Andy Hind edited comment on LUCENE-6968 at 5/6/16 8:43 PM: --- After a bit more digging, the single hash and keeping the minimum set can be improved. See: [1] http://jmlr.org/proceedings/papers/v32/shrivastava14.pdf [2] http://www.auai.org/uai2014/proceedings/individuals/225.pdf In summary: rather than keep the minimum set, split the hash space up into 500 buckets (for a 500 hash fingerprint) and keep the minimum for each bucket. To fill an empty bucket, take the minimum from the next non-empty bucket on the right with rotation. was (Author: andyhind): After a bit more digging, the single hash and keeping the minimum set can be improved. See: [1] http://jmlr.org/proceedings/papers/v32/shrivastava14.pdf [2] http://www.auai.org/uai2014/proceedings/individuals/225.pdf In summary: rather than keep the minimum set, split the hash space up into 500 buckets (for a 500 hash fingerprint) and keep the minimum for each bucket. To fill an empty bucket, take the minimum from the next non-empty bucket on the right adding an offset for each step taken. > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat >Assignee: Tommaso Teofili > Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, > LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15259924#comment-15259924 ] Andy Hind edited comment on LUCENE-6968 at 4/28/16 9:27 AM: This comes down to "what is a good estimate of |A U B|" and do we need it for the use case. For query, the main use case is finding documents like one source document. So we are comparing Doc A with all other documents. What we need is a measure that is fair for A -> B, A -> C. We probably do not care about B -> C. If we take the fingerprint from A and just OR the bits together into a big query we have a consistent measure of similarity of A with any other document. This particular measure is biased. For a start Sim(A, B) is not equal to Sim(B, A). But for this use case that may not matter. This measure contains both inclusion and duplication which may be a good thing. It is also pretty intuitive what it means. This is (A ∩ B)/|A|. If we want Sim(A, B) = Sim(B, A) then we need some consistent measure/sample of |A U B| to normalise our measure/estimate of A ∩ B. This could be (|A| + |B| - |A ∩ B|), or some similar estimate. We could use the size of the fingerprint sets. We could keep the full ordered set of hashes and have extra statistics like the total number of hashes and total number of unique hashes. For two short documents, where there are fewer fingerprints than the maximum, we have the full sets. For two larger docs we have an estimate of these based in the min hash sets. You can argue "min of many hashes" is a random sample with replacement and "min set of one hash" is a random sample without replacement; if your hash is good. If the sample is small compared with the set of all hashes the arguments converge. If we were doing arbitrary comparisons between any pair of documents then we would have to use an unbiased estimator. Finding candidate pairs, moving onto clustering, ... was (Author: andyhind): This comes down to "what is a good estimate of |A U B|" and do we need it for the use case. For query, the main use case is finding documents like one source document. So we are comparing Doc A with all other documents. What we need is a measure that is fair for A -> B, A -> C. We probably do not care about B -> C. If we take the fingerprint from A and just OR the bits together into a big query we have a consistent measure of similarity of A with any other document. This particular measure is biased. For a start Sim(A, B) is not equal to Sim(B, A). But for this use case that may not matter. This measure contains both inclusion and duplication which may be a good thing. It is also pretty intuitive what it means. This is (A ∩ B)/|A|. If we want Sim(A, B) = Sim(B, A) then we need some consistent measure/sample of |A U B| to normalise our measure/estimate of A ∩ B. This could be (|A| + |B| - |A ∩ B|), or some similar estimate. We could use the size of the fingerprint set. We could keep the full ordered set of hashes and have extra statistics like the total number of hashes and total number of unique hashes. For two short documents, where there are fewer fingerprints than the maximim, we have the full set. For two larger docs we have an estimate of these based in the min hash sets. You can argue "min of many hashes" is a random sample with replacement and "min set of one hash" is a random sample with out replacement (min set); if your hash is good. If the sample is small compared with the set of all hashes the arguments converge. If we were doing arbitrary comparisons between any pair of documents then we would have to use an unbiased estimator. Finding candidate pairs, moving onto clustering, ... > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat > Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA
[jira] [Comment Edited] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257732#comment-15257732 ] Cao Manh Dat edited comment on LUCENE-6968 at 4/26/16 8:29 AM: --- Thanks for the link. I totally agree that keeping some lowest values for single hash function would be better. But in the wiki doc. It pointed out that the estimator formulation for "variant with a single hash function" is not same as the estimator formulation for "variant with many hash function". So the generated query must be different for each case. For example, in case we use single hash function and keep some lowest values : - We have doc A = [1, 2, 5, 6, 7], doc B = [3, 4, 5, 6, 7] - So jaccard(A,B) = size( hk(A U B) ∩ hk(A) ∩ hk(B) ) / k = size( [5] ) / k = 0.2 was (Author: caomanhdat): Thanks for the link. I totally agree that keeping some lowest values for single hash function would be better. But in the wiki doc. It pointed out that the estimator formulation for "variant with a single hash function" is not same as the estimator formulation for "variant with many hash function". So the generated query must be different for each case. For example, in case we use single hash function and keep some lowest values : - We have doc A = [1, 2, 5, 6, 7], doc B = [3, 4, 5, 6, 7] - So jaccard(A,B) = |hk(A U B) ∩ hk(A) ∩ hk(B)| / k = | { 5 } | / k = 0.2 > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat > Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257732#comment-15257732 ] Cao Manh Dat edited comment on LUCENE-6968 at 4/26/16 8:28 AM: --- Thanks for the link. I totally agree that keeping some lowest values for single hash function would be better. But in the wiki doc. It pointed out that the estimator formulation for "variant with a single hash function" is not same as the estimator formulation for "variant with many hash function". So the generated query must be different for each case. For example, in case we use single hash function and keep some lowest values : - We have doc A = [1, 2, 5, 6, 7], doc B = [3, 4, 5, 6, 7] - So jaccard(A,B) = |hk(A U B) ∩ hk(A) ∩ hk(B)| / k = | { 5 } | / k = 0.2 was (Author: caomanhdat): Thanks for the link. I totally agree that keeping some lowest values for single hash function would be better. But in the wiki doc. It pointed out that the estimator formulation for "variant with a single hash function" is not same as the estimator formulation for "variant with many hash function". So the generated query must be different for each case. For example, in case we use single hash function and keep some lowest values : - We have doc A = [1, 2, 5, 6, 7], doc B = [3, 4, 5, 6, 7] - So jaccard(A,B) = |hk(A U B) ∩ hk(A) ∩ hk(B)| / k = | { 5 } | / k = 0.2 > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat > Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257732#comment-15257732 ] Cao Manh Dat edited comment on LUCENE-6968 at 4/26/16 8:26 AM: --- Thanks for the link. I totally agree that keeping some lowest values for single hash function would be better. But in the wiki doc. It pointed out that the estimator formulation for "variant with a single hash function" is not same as the estimator formulation for "variant with many hash function". So the generated query must be different for each case. For example, in case we use single hash function and keep some lowest values : - We have doc A = [1, 2, 5, 6, 7], doc B = [3, 4, 5, 6, 7] - So jaccard(A,B) = |hk(A U B) ∩ hk(A) ∩ hk(B)| / k = | { 5 } | / k = 0.2 was (Author: caomanhdat): Thanks for the link. I totally agree that keeping some lowest values for single hash function would be better. But in the wiki doc. It pointed out that the estimator formulation for "variant with a single hash function" is not same as the estimator formulation for "variant with many hash function". So the generated query must be different for each case. For example, in case we use single hash function and keep some lowest values : 1. We have doc A = [1, 2, 5, 6, 7], doc B = [3, 4, 5, 6, 7] 3. So jaccard(A,B) = |hk(A U B) ∩ hk(A) ∩ hk(B)| / k = |{5}| / k = 0.2 > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat > Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15256114#comment-15256114 ] Andy Hind edited comment on LUCENE-6968 at 4/25/16 11:52 AM: - The argument here says it is pretty much the same. {code} https://en.wikipedia.org/wiki/MinHash {code} The plan was to offer both options. With respect to banding and finding docs related to some start document, the number of hashes may depend on the start document. Let's start with 5 word shingles, one hash and keep the minimum 100 hash values. For a five word document we get one hash. For a 100 word doc where all the shingles/words are the same we get one hash. For all different shingles we get 96 hashes. If we have 100 different hashes and keep the lowest one all the above cases end up with 100 hashes. So back to banding. With minimum sets, you need to look and see how many hashes you really got and then do the banding. Comparing a small documents/snippet (where we get 10 hashes in the fingerprint)with a much larger document (where we get 100 hashes) is an interesting case to consider. Starting with the small document there are fewer bits to match in the generated query. With 100 hashes from the small document I think you end up in the roughly same place, except for small snippets. Any given band is more likely to have the same shingle hashed different ways. With a 100 hash fingerprint, sampling for 100 words is great but not so great for 100,000 words. With a minimum set we have the option to generate a finger print related to the document length and other features. There is also an argument for a winnowing approach. was (Author: andyhind): The argument here says it is pretty much the same. {code} https://en.wikipedia.org/wiki/MinHash {code} The plan was to offer both options. With respect to banding and finding docs related to some start document, the number of hashes may depend on the start document. Let's start with 5 word shingles, one hash and keep the minimum 100 hash values. For a five word document we get one hash. For a 100 word doc where all the shingles/words are the same we get one hash. For all different shingles we get 96 hashes. If we have 100 different hashes and keep the lowest one all the above cases end up with 100 hashes. So back to banding. With minimum sets, you need to look and see how many hashes you really got and then do the banding. Comparing a small documents/snippet (where we get 10 hashes in the fingerprint)with a much larger document (where we get 100 hashes) is an interesting case to consider. Starting with the small document there are fewer bits to match in the generated query. With 100 hashes from the small document I think you end up in the roughly same place, except for small snippets. Any given band is more likely to have the same shingle hashed different ways. There is also an argument for a winnowing approach. With a 100 hash fingerprint, sampling for 100 words is great but not so great for 100,000 words. With a minimum set we have the option to generate a finger print related to the document length and other features. > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat > Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255434#comment-15255434 ] Cao Manh Dat edited comment on LUCENE-6968 at 4/25/16 4:52 AM: --- What's a wonderful patch. The code is optimized, sure that the the index will be much smaller! But the patch keep some lowest values for each position, did it affect the formula {code} Pr(h(s1) = h(s2)) = Jaccard(s1,s2) {code} was (Author: caomanhdat): What's a wonderful patch. The code is optimized, sure that the the index will be much smaller! But the patch keep some lowest values for each position, so for given expectedTruePositive how can we compute the band size? > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat > Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255434#comment-15255434 ] Cao Manh Dat edited comment on LUCENE-6968 at 4/25/16 2:14 AM: --- What's a wonderful patch. The code is optimized, sure that the the index will be much smaller! But the patch keep some lowest values for each position, so for given expectedTruePositive how can we compute the band size? was (Author: caomanhdat): What's a wonderful patch. The code is optimized, sure that the the index will be much smaller! But there are some line that's quite hard to understand for me. - Is combineOrdered(HasCode... ) necessary? - The patch keep some lowest values for each position, so for given expectedTruePositive how can we compute the band size? > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat > Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252743#comment-15252743 ] Andy Hind edited comment on LUCENE-6968 at 4/21/16 11:06 PM: - Hi It would be quite common to use min hashing after shingling. At this point the number of possible word combinations vs the size of the hash is important. With shingles of 5 words from 100,000 that is 10e25 combinations. Some naive processing of the ~500k Enron emails (splitting on white space, case folding and 5 word shingles) gives ~52M combinations. So a long hash would be better at 1.8e19. I have not yet looked at a larger corpus. The LSH query is neat. However the logic can give banding where the last band is uneven. In the patch I think the last band would be dropped unless bands * rows in band = # of hashes The underlying state of the source filter may also be lost (if using shingling) I do not believe the similarity is required at all. I think you can get Jaccard distance using constant score queries and disabling coordination on the boolean query. I went for 128-bit hashes, or a 32 bit hash identifier + 96 bit hash with a bit more flexibility allowing a minimum set of hash values for a bunch of hashes. There is clearly some trade off for speed of hashing and over representing short documents. The minimum set may be a solution to this. I think there is some interesting research there. I will add my patch inspired by the original and apologise for the mixed formatting in advance .. was (Author: andyhind): Hi It would be quite common to use min hashing after shingling. At this point the number of possible word combinations vs the size of the hash is important. With shingles of 5 words from 100,000 that is 10e25 combinations. Some naive processing of the ~500k Enron emails (splitting on white space, case folding and 5 word shingles) gives ~1e13 combinations. So a long hash would be better at 1.8e19. I have not yet looked at a larger corpus. The LSH query is neat. However the logic can give banding where the last band is uneven. In the patch I think the last band would be dropped unless bands * rows in band = # of hashes The underlying state of the source filter may also be lost (if using shingling) I do not believe the similarity is required at all. I think you can get Jaccard distance using constant score queries and disabling coordination on the boolean query. I went for 128-bit hashes, or a 32 bit hash identifier + 96 bit hash with a bit more flexibility allowing a minimum set of hash values for a bunch of hashes. There is clearly some trade off for speed of hashing and over representing short documents. The minimum set may be a solution to this. I think there is some interesting research there. I will add my patch inspired by the original and apologise for the mixed formatting in advance .. > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat > Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093091#comment-15093091 ] Cao Manh Dat edited comment on LUCENE-6968 at 1/12/16 1:36 AM: --- Yes, It kinda like finding K nearest neighbor. But there a different here: - In K nearest neighbor, we use K as parameter which decide how many nearest neighbor we wanna retrieve. - In LSH we use a radius as parameter. The radius define minimum distance between center doc and other docs. (LSH is also far faster than K nearest neighbor) In both case, the result will be rank by distance to the center doc. was (Author: caomanhdat): Yes, It kinda like finding K nearest neighbor. But there a different here: - In K nearest neighbor, we use K as parameter which decide how many nearest neighbor we wanna retrieve. - In LSH we use a radius as parameter. The radius define minimum distance between center doc and other docs. In both case, the result will be rank by distance to the center doc. > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat > Attachments: LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15091661#comment-15091661 ] Cao Manh Dat edited comment on LUCENE-6968 at 1/11/16 9:25 AM: --- This new patch include : - CosineHashFilter support finding similar documents based on cosine similarity. - LSH#createSlowQuery + LSHSimilarity which support finding similar documents on arbitrary similarity and similar score results. {quote} doc1 : "Solr is an open source search engine based on Lucene" doc2 : "Solr is an open source enterprise search engine based on Lucene" doc3 : "Solr is an open source enterprise search engine based on Lucene" When we search for similarity of doc1. The result will contain: doc1 : score = 100 doc2 : score = 90 doc3 : score = 87 {quote} was (Author: caomanhdat): This new patch include : - CosineHashFilter support finding similar documents based on cosine similarity. - LSH#createSlowQuery + LSHSimilarity which support finding similar documents on arbitrary similarity and similar score results. {quote} doc1 : "Solr is an open source search engine based on Lucene" doc2 : "Solr is an open source enterprise search engine based on Lucene" doc3 : "Solr is an open source enterprise search engine based on Lucene" When we search for similarity of doc1. The result will contain: {quote} doc1 : score = 100 doc2 : score = 90 doc3 : score = 87 {quote} {quote} > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat > Attachments: LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard with this doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15091661#comment-15091661 ] Cao Manh Dat edited comment on LUCENE-6968 at 1/11/16 9:24 AM: --- This new patch include : - CosineHashFilter support finding similar documents based on cosine similarity. - LSH#createSlowQuery + LSHSimilarity which support finding similar documents on arbitrary similarity and similar score results. {quote} doc1 : "Solr is an open source search engine based on Lucene" doc2 : "Solr is an open source enterprise search engine based on Lucene" doc3 : "Solr is an open source enterprise search engine based on Lucene" When we search for similarity of doc1. The result will contain: {quote} doc1 : score = 100 doc2 : score = 90 doc3 : score = 87 {quote} {quote} was (Author: caomanhdat): This new patch include : - CosineHashFilter support finding similar documents based on cosine similarity. - LSH#createSlowQuery + LSHSimilarity which support finding similar documents on arbitrary similarity and similar score results. {quote} doc1 : "Solr is an open source search engine based on Lucene" doc2 : "Solr is an open source enterprise search engine based on Lucene" doc3 : "Solr is an open source enterprise search engine based on Lucene" When we search for similarity of doc1. The result will contain: doc1 : score = 100 doc2 : score = 90 doc3 : score = 87 {quote} > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat > Attachments: LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard with this doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15091661#comment-15091661 ] Cao Manh Dat edited comment on LUCENE-6968 at 1/11/16 9:29 AM: --- This new patch include : - Change IntHash to MultiplyShiftHash (faster universal hashing) - CosineHashFilter support finding similar documents based on cosine similarity. - LSH#createSlowQuery + LSHSimilarity which support finding similar documents on arbitrary similarity and similar score results. {quote} doc1 : "Solr is an open source search engine based on Lucene" doc2 : "Solr is an open source enterprise search engine based on Lucene" doc3 : "Solr is an open source enterprise search engine based on Lucene" When we search for similarity of doc1. The result will contain: doc1 : score = 100 doc2 : score = 90 doc3 : score = 87 {quote} was (Author: caomanhdat): This new patch include : - CosineHashFilter support finding similar documents based on cosine similarity. - LSH#createSlowQuery + LSHSimilarity which support finding similar documents on arbitrary similarity and similar score results. {quote} doc1 : "Solr is an open source search engine based on Lucene" doc2 : "Solr is an open source enterprise search engine based on Lucene" doc3 : "Solr is an open source enterprise search engine based on Lucene" When we search for similarity of doc1. The result will contain: doc1 : score = 100 doc2 : score = 90 doc3 : score = 87 {quote} > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat > Attachments: LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org