[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili updated LUCENE-6968:
------------------------------------
    Attachment: LUCENE-6968.4.patch

attaching a slightly modified version of the last patch:
- added service loader binding for MinHashFilterFactory
- added IntelliJ required dependencies
- minor fixes to javadoc (and code style to be consistent with rest of the 
codebase)

I've noticed though that the filter doesn't perfectly align the end offset 
attribute (being beyond the input length), in fact if I run all tests the 
{{TestFactories}} one fails with the following:
{noformat}
Suite: org.apache.lucene.analysis.core.TestFactories
   [junit4]   2> TEST FAIL: useCharFilter=true text='uuzfmo'
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestFactories 
-Dtests.method=test -Dtests.seed=9CF9D39BDAB31A80 -Dtests.slow=true 
-Dtests.locale=sv -Dtests.timezone=Asia/Choibalsan -Dtests.asserts=true 
-Dtests.file.encoding=US-ASCII
   [junit4] FAILURE 13.5s J3 | TestFactories.test <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: endOffset must be <= 
finalOffset: got endOffset=7 vs finalOffset=6
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([9CF9D39BDAB31A80:14ADEC41744F7778]:0)
   [junit4]    >        at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:211)
   [junit4]    >        at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:300)
   [junit4]    >        at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:304)
   [junit4]    >        at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:828)
   [junit4]    >        at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:627)
   [junit4]    >        at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:525)
   [junit4]    >        at 
org.apache.lucene.analysis.core.TestFactories.doTestTokenFilter(TestFactories.java:105)
   [junit4]    >        at 
org.apache.lucene.analysis.core.TestFactories.test(TestFactories.java:58)
   [junit4]    >        at java.lang.Thread.run(Thread.java:745)
   [junit4]   2> NOTE: test params are: codec=Lucene60, 
sim=RandomSimilarity(queryNorm=true,coord=no): {}, locale=sv, 
timezone=Asia/Choibalsan
   [junit4]   2> NOTE: Mac OS X 10.11.3 x86_64/Oracle Corporation 1.8.0_45 
(64-bit)/cpus=8,threads=1,free=197816752,total=324534272
   [junit4]   2> NOTE: All tests run in this JVM: 
[TestPatternCaptureGroupTokenFilter, TestSnowballPorterFilterFactory, 
TestBulgarianStemFilterFactory, TestFactories]
{noformat}

> LSH Filter
> ----------
>
>                 Key: LUCENE-6968
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6968
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Cao Manh Dat
>         Attachments: LUCENE-6968.4.patch, LUCENE-6968.patch, 
> LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to