Re: Best fuzzy match on multiple terms

Tomoko Uchida Fri, 14 Jun 2019 00:33:12 -0700

Hi Matthias,

What similarity class are you using.
Just a guess... but possibly one reason is document (field) length
normalization. Generally speaking shorter documents would get higher
scores than longer documents.  (I saw that classic TFIDF similarity
tends to give much higher scores to shorter documents. Newer version
of lucene uses BM25 similarity as default, that moderates the tendency
and has a tuning parameter 'b' to control the normalization effect.)
See also: 
https://www.elastic.co/guide/en/elasticsearch/guide/current/pluggable-similarites.html


As Namgyu Kim said, explain() API could help you to examine the details.

Tomoko

2019年6月14日(金) 1:27 Namgyu Kim <[email protected]>:
>
> Dear Matthias,
>
> First you need to know about the Lucene's ranking concept.
> Lucene's basic ranking is BM25 and it depends on your index status.
> (https://en.wikipedia.org/wiki/Okapi_BM25)
> There can be many reasons.
> One of thing that I can guess is your index has a lot of 'rozi' term so it
> is getting worthless.
> It is called IDF(Inverse Document Frequency).
> Anyway, if you want to be a micro controller, you need to understand the
> BM25 expression.
>
> And Lucene can tell you how your score came out.
> Explanation can be used to get it.
> I attach the sample code.
> ======================================
> IndexSearcher searcher = new IndexSearcher(reader);
> TopDocs docs = searcher.search(q, hitsPerPage);
> ScoreDoc[] hits = docs.scoreDocs;
>
> for (int i = 0; i < hits.length; ++i) {
>   int docId = hits[i].doc;
>   Explanation explanation = searcher.explain(q, docId);
>   // You can see how the score is calculated
>   System.out.println("Explanation : " + explanation.toString());
> }
> ======================================
>
> I hope it helps :D
>
> Best regards,
> Namgyu Kim
>
> P.S. For BM25, the default value in Lucene is k1 = 1.2, b = 0.75.
>
> 2019년 6월 14일 (금) 오전 12:54, <[email protected]>님이 작성:
>
> > i would suggest trying (indexing and searching) without === ' === s and
> > see You can find it first.
> >
> > Thanks
> >
> >
> > On 6/13/19 11:25 AM, Matthias Müller wrote:
> > > I am currently matching botanic names (with possible mis-spellings)
> > > against an indexed referenced list with Lucene. After quick progress in
> > > the beginning, I am struggeling with the proper query design to achieve
> > > a ranking result I want.
> > >
> > > Here is an example:
> > >
> > > Search term: Acer campestre 'Rozi'
> > >
> > > Tokenized (decomposed) representation:
> > > acer
> > > campestre
> > > rozi
> > >
> > > Top 10 hits:
> > > {value=Acer campestre, score=12.288989}
> > > {value=Acer campestre 'Rozi', score=11.955223} // <- why is it 2nd?
> > > {value=Acer campestre 'Arends', score=10.640412}
> > > {value=Acer campestre subsp. leiocarpon, score=10.640412}
> > > {value=Acer campestre 'Carnival', score=10.640412}
> > > {value=Acer campestre 'Commodore', score=10.640412}
> > > {value=Acer campestre 'Nanum', score=10.640412}
> > > {value=Acer campestre 'Elsrijk', score=10.640412}
> > > {value=Acer campestre 'Fastigiatum', score=10.640412}
> > > {value=Acer campestre 'Geessink', score=10.640412}]
> > >
> > >
> > > And here is how I create my queries:
> > >
> > > final BooleanQuery.Builder builder = new BooleanQuery.Builder();
> > >    // add individual tokens to query
> > >    for (String token : fuzzyTokens) {
> > >      final Term term = new Term(NAME_TOKENS.name(), token);
> > >      final FuzzyQuery fq = new FuzzyQuery(term);
> > >      builder.add(fq, BooleanClause.Occur.SHOULD);
> > >    }
> > >    return builder.build();
> > > }
> > >
> > >
> > > Input names are analyzed with a StandardTokenizer and Lowercase filter
> > > when they are added to the IndexWriter.
> > >
> > >
> > > My question: How can I get a ranking that scores
> > > "Acer campestre 'Rozi'" higher than "Acer campestre"?
> > > I am sure there is an obvious way to achieve this that I have yet
> > > failed to find.
> > >
> > >
> > > -Matthias
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Best fuzzy match on multiple terms

Reply via email to