[jira] [Comment Edited] (LUCENE-6968) LSH Filter

2016-06-13 Thread Andy Hind (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328054#comment-15328054
 ] 

Andy Hind edited comment on LUCENE-6968 at 6/13/16 7:55 PM:


Hi Tommaso, the MinHashFilterTest was running fine. It was JapaneseNumberFilter 
that was failing intermittently. I think on one of the evil test cases.

LongPair should implement equals (and probably hashCode if it will be reused) 
as it goes into a TreeSet. An over sight on my part.

FWIW, as far as I can tell, the change in patch 6 was included in 5.


was (Author: andyhind):
Hi Tommaso, the MinHashFilterTest was running fine. It was JapaneseNumberFilter 
that was failing intermittently. I think on one of the evil test cases.

LongPair should implement equals (and probably hashCode if it will be reused) 
as it goes into a TreeSet. An over sight on my part.

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, 
> LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-6968) LSH Filter

2016-05-06 Thread Andy Hind (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15263867#comment-15263867
 ] 

Andy Hind edited comment on LUCENE-6968 at 5/6/16 8:43 PM:
---

After a bit more digging, the single hash and keeping the minimum set can be 
improved.

See: 
[1] http://jmlr.org/proceedings/papers/v32/shrivastava14.pdf
[2] http://www.auai.org/uai2014/proceedings/individuals/225.pdf

In summary: rather than keep the minimum set, split the hash space up into 500 
buckets (for a 500 hash fingerprint) and keep the minimum for each bucket. To 
fill an empty bucket, take the minimum from the next non-empty bucket on the 
right with rotation. 


was (Author: andyhind):
After a bit more digging, the single hash and keeping the minimum set can be 
improved.

See: 
[1] http://jmlr.org/proceedings/papers/v32/shrivastava14.pdf
[2] http://www.auai.org/uai2014/proceedings/individuals/225.pdf

In summary: rather than keep the minimum set, split the hash space up into 500 
buckets (for a 500 hash fingerprint) and keep the minimum for each bucket. To 
fill an empty bucket, take the minimum from the next non-empty bucket on the 
right adding an offset for each step taken. 

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, 
> LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-6968) LSH Filter

2016-04-28 Thread Andy Hind (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15259924#comment-15259924
 ] 

Andy Hind edited comment on LUCENE-6968 at 4/28/16 9:27 AM:


This comes down to "what is a good estimate of |A U B|" and do we need it for 
the use case.

For query, the main use case is finding documents like one source document. So 
we are comparing Doc A with all other documents. What we need is a measure that 
is fair for A -> B, A -> C. We probably do not care about B -> C. If we take 
the fingerprint from A and just OR the bits together into a big query we have a 
consistent measure of similarity of A with any other document. This particular 
measure is biased. For a start Sim(A, B) is not equal to Sim(B, A). But for 
this use case that may not matter. This measure contains both inclusion and 
duplication which may be a good thing. It is also pretty intuitive what it 
means. This is (A ∩ B)/|A|.

If we want Sim(A, B) = Sim(B, A) then we need some consistent measure/sample of 
|A U B| to normalise our measure/estimate of A ∩ B. This could be (|A| + |B| - 
|A ∩ B|), or some similar estimate. We could use the size of the fingerprint 
sets. We could keep the full ordered set of hashes and have extra statistics 
like the total number of hashes and total number of unique hashes. 

For two short documents, where there are fewer fingerprints than the maximum, 
we have the full sets.
For two larger docs we have an estimate of these based in the min hash sets.  
You can argue "min of many hashes"  is a random sample with replacement and 
"min set of one hash" is a random sample without replacement; if your hash is 
good. If the sample is small compared with the set of all hashes the arguments 
converge. 

If we were doing arbitrary comparisons between any pair of documents then we 
would have to use an unbiased estimator. Finding candidate pairs, moving onto 
clustering, ...



was (Author: andyhind):
This comes down to "what is a good estimate of |A U B|" and do we need it for 
the use case.

For query, the main use case is finding documents like one source document. So 
we are comparing Doc A with all other documents. What we need is a measure that 
is fair for A -> B, A -> C. We probably do not care about B -> C. If we take 
the fingerprint from A and just OR the bits together into a big query we have a 
consistent measure of similarity of A with any other document. This particular 
measure is biased. For a start Sim(A, B) is not equal to Sim(B, A). But for 
this use case that may not matter. This measure contains both inclusion and 
duplication which may be a good thing. It is also pretty intuitive what it 
means. This is (A ∩ B)/|A|.

If we want Sim(A, B) = Sim(B, A) then we need some consistent measure/sample of 
|A U B| to normalise our measure/estimate of A ∩ B. This could be (|A| + |B| - 
|A ∩ B|), or some similar estimate. We could use the size of the fingerprint 
set. We could keep the full ordered set of hashes and have extra statistics 
like the total number of hashes and total number of unique hashes. 

For two short documents, where there are fewer fingerprints than the maximim, 
we have the full set.
For two larger docs we have an estimate of these based in the min hash sets.  
You can argue "min of many hashes"  is a random sample with replacement and 
"min set of one hash" is a random sample with out replacement (min set); if 
your hash is good. If the sample is small compared with the set of all hashes 
the arguments converge. 

If we were doing arbitrary comparisons between any pair of documents then we 
would have to use an unbiased estimator. Finding candidate pairs, moving onto 
clustering, ...


> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA

[jira] [Comment Edited] (LUCENE-6968) LSH Filter

2016-04-26 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257732#comment-15257732
 ] 

Cao Manh Dat edited comment on LUCENE-6968 at 4/26/16 8:29 AM:
---

Thanks for the link. I totally agree that keeping some lowest values for single 
hash function would be better. 

But in the wiki doc. It pointed out that the estimator formulation for "variant 
with a single hash function" is not same as the estimator formulation for 
"variant with many hash function". So the generated query must be different for 
each case.

For example, in case we use single hash function and keep some lowest values :
- We have doc A = [1, 2, 5, 6, 7], doc B = [3, 4, 5, 6, 7]
- So jaccard(A,B) = size( hk(A U B) ∩ hk(A) ∩ hk(B) ) / k =  size( [5] ) / k = 
0.2


was (Author: caomanhdat):
Thanks for the link. I totally agree that keeping some lowest values for single 
hash function would be better. 

But in the wiki doc. It pointed out that the estimator formulation for "variant 
with a single hash function" is not same as the estimator formulation for 
"variant with many hash function". So the generated query must be different for 
each case.

For example, in case we use single hash function and keep some lowest values :
- We have doc A = [1, 2, 5, 6, 7], doc B = [3, 4, 5, 6, 7]
- So jaccard(A,B) = |hk(A U B) ∩ hk(A) ∩ hk(B)| / k 
=  | { 5 } | / k = 0.2

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-6968) LSH Filter

2016-04-26 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257732#comment-15257732
 ] 

Cao Manh Dat edited comment on LUCENE-6968 at 4/26/16 8:28 AM:
---

Thanks for the link. I totally agree that keeping some lowest values for single 
hash function would be better. 

But in the wiki doc. It pointed out that the estimator formulation for "variant 
with a single hash function" is not same as the estimator formulation for 
"variant with many hash function". So the generated query must be different for 
each case.

For example, in case we use single hash function and keep some lowest values :
- We have doc A = [1, 2, 5, 6, 7], doc B = [3, 4, 5, 6, 7]
- So jaccard(A,B) = |hk(A U B) ∩ hk(A) ∩ hk(B)| / k 
=  | { 5 } | / k = 0.2


was (Author: caomanhdat):
Thanks for the link. I totally agree that keeping some lowest values for single 
hash function would be better. 

But in the wiki doc. It pointed out that the estimator formulation for "variant 
with a single hash function" is not same as the estimator formulation for 
"variant with many hash function". So the generated query must be different for 
each case.

For example, in case we use single hash function and keep some lowest values :
- We have doc A = [1, 2, 5, 6, 7], doc B = [3, 4, 5, 6, 7]
- So jaccard(A,B) = |hk(A U B) ∩ hk(A) ∩ hk(B)| / k =  | { 5 } | / k = 0.2

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-6968) LSH Filter

2016-04-26 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257732#comment-15257732
 ] 

Cao Manh Dat edited comment on LUCENE-6968 at 4/26/16 8:26 AM:
---

Thanks for the link. I totally agree that keeping some lowest values for single 
hash function would be better. 

But in the wiki doc. It pointed out that the estimator formulation for "variant 
with a single hash function" is not same as the estimator formulation for 
"variant with many hash function". So the generated query must be different for 
each case.

For example, in case we use single hash function and keep some lowest values :
- We have doc A = [1, 2, 5, 6, 7], doc B = [3, 4, 5, 6, 7]
- So jaccard(A,B) = |hk(A U B) ∩ hk(A) ∩ hk(B)| / k =  | { 5 } | / k = 0.2


was (Author: caomanhdat):
Thanks for the link. I totally agree that keeping some lowest values for single 
hash function would be better. 

But in the wiki doc. It pointed out that the estimator formulation for "variant 
with a single hash function" is not same as the estimator formulation for 
"variant with many hash function". So the generated query must be different for 
each case.

For example, in case we use single hash function and keep some lowest values :
1. We have doc A = [1, 2, 5, 6, 7], doc B = [3, 4, 5, 6, 7]
3. So jaccard(A,B) = |hk(A U B) ∩ hk(A) ∩ hk(B)| / k = |{5}| / k = 0.2

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-6968) LSH Filter

2016-04-25 Thread Andy Hind (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15256114#comment-15256114
 ] 

Andy Hind edited comment on LUCENE-6968 at 4/25/16 11:52 AM:
-

The argument here says it is pretty much the same.

{code}
https://en.wikipedia.org/wiki/MinHash
{code}

The plan was to offer both options.

With respect to banding and finding docs related to some start document, the 
number of hashes may depend on the start document. 

Let's start with 5 word shingles, one hash and keep the minimum 100 hash 
values. For a five word document we get one hash. For a 100 word doc where all 
the shingles/words are the same we get one hash. For all different shingles we 
get 96 hashes.

If we have 100 different hashes and keep the lowest one all the above cases end 
up with 100 hashes.

So back to banding. With minimum sets, you need to look and see how many hashes 
you really got and then do the banding. Comparing a small documents/snippet 
(where we get 10 hashes in the fingerprint)with a much larger document (where 
we get 100 hashes) is an interesting case to consider. Starting with the small 
document there are fewer bits to match in the generated query. With 100 hashes 
from the small document I think you end up in the roughly same place, except 
for small snippets. Any given band is more likely to have the same shingle 
hashed different ways.

With a 100 hash fingerprint, sampling for 100 words is great but not so great 
for 100,000 words. With a minimum set we have the option to generate a finger 
print related to the document length and other features.

There is also an argument for a winnowing approach. 




was (Author: andyhind):
The argument here says it is pretty much the same.

{code}
https://en.wikipedia.org/wiki/MinHash
{code}

The plan was to offer both options.

With respect to banding and finding docs related to some start document, the 
number of hashes may depend on the start document. 

Let's start with 5 word shingles, one hash and keep the minimum 100 hash 
values. For a five word document we get one hash. For a 100 word doc where all 
the shingles/words are the same we get one hash. For all different shingles we 
get 96 hashes.

If we have 100 different hashes and keep the lowest one all the above cases end 
up with 100 hashes.

So back to banding. With minimum sets, you need to look and see how many hashes 
you really got and then do the banding. Comparing a small documents/snippet 
(where we get 10 hashes in the fingerprint)with a much larger document (where 
we get 100 hashes) is an interesting case to consider. Starting with the small 
document there are fewer bits to match in the generated query. With 100 hashes 
from the small document I think you end up in the roughly same place, except 
for small snippets. Any given band is more likely to have the same shingle 
hashed different ways.

There is also an argument for a winnowing approach. With a 100 hash 
fingerprint, sampling for 100 words is great but not so great for 100,000 
words. With a minimum set we have the option to generate a finger print related 
to the document length and other features.




> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-6968) LSH Filter

2016-04-24 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255434#comment-15255434
 ] 

Cao Manh Dat edited comment on LUCENE-6968 at 4/25/16 4:52 AM:
---

What's a wonderful patch. The code is optimized, sure that the the index will 
be much smaller!

But the patch keep some lowest values for each position, did it affect the 
formula 
{code} Pr(h(s1) = h(s2)) = Jaccard(s1,s2) {code}


was (Author: caomanhdat):
What's a wonderful patch. The code is optimized, sure that the the index will 
be much smaller!

But the patch keep some lowest values for each position, so for given 
expectedTruePositive how can we compute the band size?

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-6968) LSH Filter

2016-04-24 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255434#comment-15255434
 ] 

Cao Manh Dat edited comment on LUCENE-6968 at 4/25/16 2:14 AM:
---

What's a wonderful patch. The code is optimized, sure that the the index will 
be much smaller!

But the patch keep some lowest values for each position, so for given 
expectedTruePositive how can we compute the band size?


was (Author: caomanhdat):
What's a wonderful patch. The code is optimized, sure that the the index will 
be much smaller!

But there are some line that's quite hard to understand for me.
- Is combineOrdered(HasCode... ) necessary?
- The patch keep some lowest values for each position, so for given 
expectedTruePositive how can we compute the band size?

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-6968) LSH Filter

2016-04-21 Thread Andy Hind (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252743#comment-15252743
 ] 

Andy Hind edited comment on LUCENE-6968 at 4/21/16 11:06 PM:
-

Hi

It would be quite common to use min hashing after shingling. At this point the 
number of possible word combinations vs the size of the hash is important. With 
shingles of 5 words from 100,000 that is 10e25 combinations. Some naive 
processing of the ~500k  Enron emails (splitting on white space, case folding 
and 5 word shingles) gives ~52M  combinations. So a long hash would be better 
at 1.8e19. I have not yet looked at a larger corpus.

The LSH query is neat. However the logic can give banding where the last band 
is uneven. In the patch I think the last band would be dropped unless bands * 
rows in band  = # of hashes

The underlying state of the source filter may also be lost (if using shingling)

I do not believe the similarity is required at all. I think you can get Jaccard 
distance using constant score queries and disabling coordination on the boolean 
query. 

I went for 128-bit hashes, or a 32 bit hash identifier + 96 bit hash with a bit 
more flexibility allowing a minimum set of hash values for a bunch of hashes. 
There is clearly some trade off for speed of hashing and over representing 
short documents. The minimum set may be a solution to this.  I think there is 
some interesting research there. 

I will add my patch inspired by the original  and apologise for the mixed 
formatting in advance ..



was (Author: andyhind):
Hi

It would be quite common to use min hashing after shingling. At this point the 
number of possible word combinations vs the size of the hash is important. With 
shingles of 5 words from 100,000 that is 10e25 combinations. Some naive 
processing of the ~500k  Enron emails (splitting on white space, case folding 
and 5 word shingles) gives ~1e13  combinations. So a long hash would be better 
at 1.8e19. I have not yet looked at a larger corpus.

The LSH query is neat. However the logic can give banding where the last band 
is uneven. In the patch I think the last band would be dropped unless bands * 
rows in band  = # of hashes

The underlying state of the source filter may also be lost (if using shingling)

I do not believe the similarity is required at all. I think you can get Jaccard 
distance using constant score queries and disabling coordination on the boolean 
query. 

I went for 128-bit hashes, or a 32 bit hash identifier + 96 bit hash with a bit 
more flexibility allowing a minimum set of hash values for a bunch of hashes. 
There is clearly some trade off for speed of hashing and over representing 
short documents. The minimum set may be a solution to this.  I think there is 
some interesting research there. 

I will add my patch inspired by the original  and apologise for the mixed 
formatting in advance ..


> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-6968) LSH Filter

2016-01-11 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093091#comment-15093091
 ] 

Cao Manh Dat edited comment on LUCENE-6968 at 1/12/16 1:36 AM:
---

Yes, It kinda like finding K nearest neighbor. But there a different here:
- In K nearest neighbor, we use K as parameter which decide how many nearest 
neighbor we wanna retrieve.
- In LSH we use a radius as parameter. The radius define minimum distance 
between center doc and other docs.  (LSH is also far faster than K nearest 
neighbor)

In both case, the result will be rank by distance to the center doc.


was (Author: caomanhdat):
Yes, It kinda like finding K nearest neighbor. But there a different here:
- In K nearest neighbor, we use K as parameter which decide how many nearest 
neighbor we wanna retrieve.
- In LSH we use a radius as parameter. The radius define minimum distance 
between center doc and other docs. 
In both case, the result will be rank by distance to the center doc.

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-6968) LSH Filter

2016-01-11 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15091661#comment-15091661
 ] 

Cao Manh Dat edited comment on LUCENE-6968 at 1/11/16 9:25 AM:
---

This new patch include :
- CosineHashFilter support finding similar documents based on cosine similarity.
- LSH#createSlowQuery + LSHSimilarity which support finding similar documents 
on arbitrary similarity and similar score results.
{quote}
doc1 : "Solr is an open source search engine based on Lucene"
doc2 : "Solr is an open source enterprise search engine based on Lucene"
doc3 : "Solr is an open source enterprise search engine based on Lucene"

When we search for similarity of doc1. The result will contain:

doc1 : score = 100
doc2 : score = 90
doc3 : score = 87
{quote}


was (Author: caomanhdat):
This new patch include :
- CosineHashFilter support finding similar documents based on cosine similarity.
- LSH#createSlowQuery + LSHSimilarity which support finding similar documents 
on arbitrary similarity and similar score results.
{quote}
doc1 : "Solr is an open source search engine based on Lucene"
doc2 : "Solr is an open source enterprise search engine based on Lucene"
doc3 : "Solr is an open source enterprise search engine based on Lucene"

When we search for similarity of doc1. The result will contain:
{quote}
doc1 : score = 100
doc2 : score = 90
doc3 : score = 87
{quote}
{quote}

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard with this doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-6968) LSH Filter

2016-01-11 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15091661#comment-15091661
 ] 

Cao Manh Dat edited comment on LUCENE-6968 at 1/11/16 9:24 AM:
---

This new patch include :
- CosineHashFilter support finding similar documents based on cosine similarity.
- LSH#createSlowQuery + LSHSimilarity which support finding similar documents 
on arbitrary similarity and similar score results.
{quote}
doc1 : "Solr is an open source search engine based on Lucene"
doc2 : "Solr is an open source enterprise search engine based on Lucene"
doc3 : "Solr is an open source enterprise search engine based on Lucene"

When we search for similarity of doc1. The result will contain:
{quote}
doc1 : score = 100
doc2 : score = 90
doc3 : score = 87
{quote}
{quote}


was (Author: caomanhdat):
This new patch include :
- CosineHashFilter support finding similar documents based on cosine similarity.
- LSH#createSlowQuery + LSHSimilarity which support finding similar documents 
on arbitrary similarity and similar score results.
{quote}
doc1 : "Solr is an open source search engine based on Lucene"
doc2 : "Solr is an open source enterprise search engine based on Lucene"
doc3 : "Solr is an open source enterprise search engine based on Lucene"
When we search for similarity of doc1. The result will contain:
doc1 : score = 100
doc2 : score = 90
doc3 : score = 87
{quote}

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard with this doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-6968) LSH Filter

2016-01-11 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15091661#comment-15091661
 ] 

Cao Manh Dat edited comment on LUCENE-6968 at 1/11/16 9:29 AM:
---

This new patch include :
- Change IntHash to MultiplyShiftHash (faster universal hashing)
- CosineHashFilter support finding similar documents based on cosine similarity.
- LSH#createSlowQuery + LSHSimilarity which support finding similar documents 
on arbitrary similarity and similar score results.
{quote}
doc1 : "Solr is an open source search engine based on Lucene"
doc2 : "Solr is an open source enterprise search engine based on Lucene"
doc3 : "Solr is an open source enterprise search engine based on Lucene"

When we search for similarity of doc1. The result will contain:

doc1 : score = 100
doc2 : score = 90
doc3 : score = 87
{quote}


was (Author: caomanhdat):
This new patch include :
- CosineHashFilter support finding similar documents based on cosine similarity.
- LSH#createSlowQuery + LSHSimilarity which support finding similar documents 
on arbitrary similarity and similar score results.
{quote}
doc1 : "Solr is an open source search engine based on Lucene"
doc2 : "Solr is an open source enterprise search engine based on Lucene"
doc3 : "Solr is an open source enterprise search engine based on Lucene"

When we search for similarity of doc1. The result will contain:

doc1 : score = 100
doc2 : score = 90
doc3 : score = 87
{quote}

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org