Re: Best fuzzy match on multiple terms

Matthias Müller Fri, 14 Jun 2019 03:46:44 -0700

Hi Namgyu and Tomoko,

your hint towards Explanation was very helpful and I was not aware of
this feature.


I have now experimented with different scoring functions and it seems
that DFISimilarity and BM25Similarity (with lower 'b') produce results
in the direction I prefer, though not perfect for some cases [1].

The fuzzy term queries probably generate hardly predictable
similarities on additional fields. These add scores to the overall
result and also affect normalization.

Positively, the preferred matches are somewhere in the top ranks. So
maybe rule-based assessment of the top N hits might help me achieve
what I want.


- Matthias


[1]:
"Abelia xgrandiflora" -> "Abelia xgrandiflora 'Wevo1' BELLA DONNA"
(score=13.7869625)
instead of the direct match
"Abelia xgrandiflora" -> "Abelia xgrandiflora" (score=13.74585)

Am Freitag, den 14.06.2019, 16:32 +0900 schrieb Tomoko Uchida:
> Hi Matthias,
> 
> What similarity class are you using.
> Just a guess... but possibly one reason is document (field) length
> normalization. Generally speaking shorter documents would get higher
> scores than longer documents.  (I saw that classic TFIDF similarity
> tends to give much higher scores to shorter documents. Newer version
> of lucene uses BM25 similarity as default, that moderates the
> tendency
> and has a tuning parameter 'b' to control the normalization effect.)
> See also: 
> https://www.elastic.co/guide/en/elasticsearch/guide/current/pluggable-similarites.html
> 
> As Namgyu Kim said, explain() API could help you to examine the
> details.
> 
> Tomoko
> 
> 2019年6月14日(金) 1:27 Namgyu Kim <[email protected]>:
> > Dear Matthias,
> > 
> > First you need to know about the Lucene's ranking concept.
> > Lucene's basic ranking is BM25 and it depends on your index status.
> > (https://en.wikipedia.org/wiki/Okapi_BM25)
> > There can be many reasons.
> > One of thing that I can guess is your index has a lot of 'rozi'
> > term so it
> > is getting worthless.
> > It is called IDF(Inverse Document Frequency).
> > Anyway, if you want to be a micro controller, you need to
> > understand the
> > BM25 expression.
> > 
> > And Lucene can tell you how your score came out.
> > Explanation can be used to get it.
> > I attach the sample code.
> > ======================================
> > IndexSearcher searcher = new IndexSearcher(reader);
> > TopDocs docs = searcher.search(q, hitsPerPage);
> > ScoreDoc[] hits = docs.scoreDocs;
> > 
> > for (int i = 0; i < hits.length; ++i) {
> >   int docId = hits[i].doc;
> >   Explanation explanation = searcher.explain(q, docId);
> >   // You can see how the score is calculated
> >   System.out.println("Explanation : " + explanation.toString());
> > }
> > ======================================
> > 
> > I hope it helps :D
> > 
> > Best regards,
> > Namgyu Kim
> > 
> > P.S. For BM25, the default value in Lucene is k1 = 1.2, b = 0.75.
> > 
> > 2019년 6월 14일 (금) 오전 12:54, <[email protected]>님이 작성:
> > 
> > > i would suggest trying (indexing and searching) without === ' ===
> > > s and
> > > see You can find it first.
> > > 
> > > Thanks
> > > 
> > > 
> > > On 6/13/19 11:25 AM, Matthias Müller wrote:
> > > > I am currently matching botanic names (with possible mis-
> > > > spellings)
> > > > against an indexed referenced list with Lucene. After quick
> > > > progress in
> > > > the beginning, I am struggeling with the proper query design to
> > > > achieve
> > > > a ranking result I want.
> > > > 
> > > > Here is an example:
> > > > 
> > > > Search term: Acer campestre 'Rozi'
> > > > 
> > > > Tokenized (decomposed) representation:
> > > > acer
> > > > campestre
> > > > rozi
> > > > 
> > > > Top 10 hits:
> > > > {value=Acer campestre, score=12.288989}
> > > > {value=Acer campestre 'Rozi', score=11.955223} // <- why is it
> > > > 2nd?
> > > > {value=Acer campestre 'Arends', score=10.640412}
> > > > {value=Acer campestre subsp. leiocarpon, score=10.640412}
> > > > {value=Acer campestre 'Carnival', score=10.640412}
> > > > {value=Acer campestre 'Commodore', score=10.640412}
> > > > {value=Acer campestre 'Nanum', score=10.640412}
> > > > {value=Acer campestre 'Elsrijk', score=10.640412}
> > > > {value=Acer campestre 'Fastigiatum', score=10.640412}
> > > > {value=Acer campestre 'Geessink', score=10.640412}]
> > > > 
> > > > 
> > > > And here is how I create my queries:
> > > > 
> > > > final BooleanQuery.Builder builder = new
> > > > BooleanQuery.Builder();
> > > >    // add individual tokens to query
> > > >    for (String token : fuzzyTokens) {
> > > >      final Term term = new Term(NAME_TOKENS.name(), token);
> > > >      final FuzzyQuery fq = new FuzzyQuery(term);
> > > >      builder.add(fq, BooleanClause.Occur.SHOULD);
> > > >    }
> > > >    return builder.build();
> > > > }
> > > > 
> > > > 
> > > > Input names are analyzed with a StandardTokenizer and Lowercase
> > > > filter
> > > > when they are added to the IndexWriter.
> > > > 
> > > > 
> > > > My question: How can I get a ranking that scores
> > > > "Acer campestre 'Rozi'" higher than "Acer campestre"?
> > > > I am sure there is an obvious way to achieve this that I have
> > > > yet
> > > > failed to find.
> > > > 
> > > > 
> > > > -Matthias
> > > > 
> > > > 
> > > > -------------------------------------------------------------
> > > > --------
> > > > To unsubscribe, e-mail: [email protected]
> > > > For additional commands, e-mail: 
> > > > [email protected]
> > > > 
> > > 
> > > ---------------------------------------------------------------
> > > ------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > > 
> > > 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Best fuzzy match on multiple terms

Reply via email to