[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15262169#comment-15262169 ]
Andy Hind commented on LUCENE-6968: ----------------------------------- I agree a pure token stream test makes sense. The only concern I have is about testing token filters chained together. Chaining shingle generation with min hashing requires that the underlying token stream has its state reset correctly for reuse. As I missed this, I added a test to cover it. Is there somewhere else in the test framework that covers this case? Some randomised chaining of filters?? Perhaps chaining is more of a SOLR thing. I would prefer to stick with a 128/96 bit hash. The link below [1] "suggests" 5-shingles become well distributed. Link [2] says upto 2/3 of all possible trigrams have been seen in 30 years of news articles. So it seems we can expect to see many of the possible 5-shingles. Some bioinformatic use cases may also require this. {quote} [1] http://googleresearch.blogspot.co.uk/2006/08/all-our-n-gram-are-belong-to-you.html [2] http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf {quote} I was not that keen to add Guava! However, it was already there somewhere. I am happy if this moves off into a separate module. I will also look to see how this dependency could be removed. Perhaps we should have some time to consider how to include the fingerprint length (sum of the min set size over all hashes) to support an unbiased query. An unbiased query would be more difficult to build correctly. Some fingerprint/LSH query support and tests may make sense. Some other statistics may also be useful in generating faster queries that find similar documents using some threshold and probability of meeting that threshold. > LSH Filter > ---------- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Cao Manh Dat > Assignee: Tommaso Teofili > Attachments: LUCENE-6968.4.patch, LUCENE-6968.patch, > LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org