Re: Best fuzzy match on multiple terms

Tomoko Uchida Fri, 14 Jun 2019 22:36:06 -0700

Hi Boris,

Query parsing and scoring/ranking are completely separated processes
so I'd debug those problems separately.
For debugging fuzzy query, Query.rewrite() method would be a good
first step (by which you can see all unrolled terms generated by fuzzy
query).
I'm not sure about what is your problem, but in many cases you also
need to take care of analyzers to get desirable or tweaked search
results.


JFYI, using Luke (a GUI tool for inspecting your Lucene
indexes/analyzers/search queries) is a convenient way for that, if
you'd like.
https://github.com/DmitryKey/luke (This has been integrated into
Lucene since 8.1, but you can download older versions from the github
repo.)

e.g.
https://twitter.com/moco_beta/status/1139754595800928256
https://twitter.com/moco_beta/status/1139758109457391616

Enjoy.
Tomoko

2019年6月15日(土) 3:09 Matthias Müller <[email protected]>:
>
> Hi Boris,
>
> "Acer campestre 'Rozi'" now receives a higher score with DFISimilarity
> and BM25Similarity (with tuned 'b') instead of the standard BM25.
>
> It really iswas a scoring/normalization issue: While "Rozi" gets a
> higher score, "Acer" and "campestere" received lower values and the
> combined result was fractions of a score below the desired hit.
>
> -Matthias
>
>
>
> Am Freitag, den 14.06.2019, 10:41 -0400 schrieb [email protected]:
> > These are great suggestions, i was going to suggest explain plan of
> > query, too.
> >
> > i really wonder in Your case why 'Rozi' entry does not get higher
> > score.
> >
> > Is there any effect from " ' " chars?
> >
> >
> > In my case i have sort of reverse situation:
> >
> > my query is maink~2 (mains was a special case where i still
> > investigate)
> >
> > i would expect the second result below to be the first result as it
> > is
> > shorter and closest hit and first result to be the second result.
> >
> > NASHUA in results: MAIN DUNSTABLE NASHUA HILLSBOROUGH NEW HAMPSHIRE
> > UNITED STATES in the 0 th result
> > NASHUA in results: MAIN NASHUA HILLSBOROUGH NEW HAMPSHIRE UNITED
> > STATES
> > in the 1 th result
> >
> >
> > Best regards
> >
> >
> > On 6/14/19 6:45 AM, Matthias Müller wrote:
> > > Hi Namgyu and Tomoko,
> > >
> > > your hint towards Explanation was very helpful and I was not aware
> > > of
> > > this feature.
> > >
> > > I have now experimented with different scoring functions and it
> > > seems
> > > that DFISimilarity and BM25Similarity (with lower 'b') produce
> > > results
> > > in the direction I prefer, though not perfect for some cases [1].
> > >
> > > The fuzzy term queries probably generate hardly predictable
> > > similarities on additional fields. These add scores to the overall
> > > result and also affect normalization.
> > >
> > > Positively, the preferred matches are somewhere in the top ranks.
> > > So
> > > maybe rule-based assessment of the top N hits might help me achieve
> > > what I want.
> > >
> > >
> > > - Matthias
> > >
> > >
> > > [1]:
> > > "Abelia xgrandiflora" -> "Abelia xgrandiflora 'Wevo1' BELLA DONNA"
> > > (score=13.7869625)
> > > instead of the direct match
> > > "Abelia xgrandiflora" -> "Abelia xgrandiflora" (score=13.74585)
> > >
> > > Am Freitag, den 14.06.2019, 16:32 +0900 schrieb Tomoko Uchida:
> > > > Hi Matthias,
> > > >
> > > > What similarity class are you using.
> > > > Just a guess... but possibly one reason is document (field)
> > > > length
> > > > normalization. Generally speaking shorter documents would get
> > > > higher
> > > > scores than longer documents.  (I saw that classic TFIDF
> > > > similarity
> > > > tends to give much higher scores to shorter documents. Newer
> > > > version
> > > > of lucene uses BM25 similarity as default, that moderates the
> > > > tendency
> > > > and has a tuning parameter 'b' to control the normalization
> > > > effect.)
> > > > See also:
> > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.elastic.co_guide_en_elasticsearch_guide_current_pluggable-2Dsimilarites.html&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=nlG5z5NcNdIbQAiX-BKNeyLlULCbaezrgocEvPhQkl4&m=EQ--nOw2fv4xC2jDVd61qmWey2RW5y71Jx5-esA5Epo&s=xgCA5llK_2kxvxRc4arpgbd1rhgRrSkOqD5j57CA-6Q&e=
> > > >
> > > > As Namgyu Kim said, explain() API could help you to examine the
> > > > details.
> > > >
> > > > Tomoko
> > > >
> > > > 2019年6月14日(金) 1:27 Namgyu Kim <[email protected]>:
> > > > > Dear Matthias,
> > > > >
> > > > > First you need to know about the Lucene's ranking concept.
> > > > > Lucene's basic ranking is BM25 and it depends on your index
> > > > > status.
> > > > > (
> > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Okapi-5FBM25&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=nlG5z5NcNdIbQAiX-BKNeyLlULCbaezrgocEvPhQkl4&m=EQ--nOw2fv4xC2jDVd61qmWey2RW5y71Jx5-esA5Epo&s=3M7Yh2-tiEHd8DVhJc5fBeVfE65WvnaXsphnx2pCdfg&e=
> > > > > )
> > > > > There can be many reasons.
> > > > > One of thing that I can guess is your index has a lot of 'rozi'
> > > > > term so it
> > > > > is getting worthless.
> > > > > It is called IDF(Inverse Document Frequency).
> > > > > Anyway, if you want to be a micro controller, you need to
> > > > > understand the
> > > > > BM25 expression.
> > > > >
> > > > > And Lucene can tell you how your score came out.
> > > > > Explanation can be used to get it.
> > > > > I attach the sample code.
> > > > > ======================================
> > > > > IndexSearcher searcher = new IndexSearcher(reader);
> > > > > TopDocs docs = searcher.search(q, hitsPerPage);
> > > > > ScoreDoc[] hits = docs.scoreDocs;
> > > > >
> > > > > for (int i = 0; i < hits.length; ++i) {
> > > > >    int docId = hits[i].doc;
> > > > >    Explanation explanation = searcher.explain(q, docId);
> > > > >    // You can see how the score is calculated
> > > > >    System.out.println("Explanation : " +
> > > > > explanation.toString());
> > > > > }
> > > > > ======================================
> > > > >
> > > > > I hope it helps :D
> > > > >
> > > > > Best regards,
> > > > > Namgyu Kim
> > > > >
> > > > > P.S. For BM25, the default value in Lucene is k1 = 1.2, b =
> > > > > 0.75.
> > > > >
> > > > > 2019년 6월 14일 (금) 오전 12:54, <[email protected]>님이 작성:
> > > > >
> > > > > > i would suggest trying (indexing and searching) without === '
> > > > > > ===
> > > > > > s and
> > > > > > see You can find it first.
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > >
> > > > > > On 6/13/19 11:25 AM, Matthias Müller wrote:
> > > > > > > I am currently matching botanic names (with possible mis-
> > > > > > > spellings)
> > > > > > > against an indexed referenced list with Lucene. After quick
> > > > > > > progress in
> > > > > > > the beginning, I am struggeling with the proper query
> > > > > > > design to
> > > > > > > achieve
> > > > > > > a ranking result I want.
> > > > > > >
> > > > > > > Here is an example:
> > > > > > >
> > > > > > > Search term: Acer campestre 'Rozi'
> > > > > > >
> > > > > > > Tokenized (decomposed) representation:
> > > > > > > acer
> > > > > > > campestre
> > > > > > > rozi
> > > > > > >
> > > > > > > Top 10 hits:
> > > > > > > {value=Acer campestre, score=12.288989}
> > > > > > > {value=Acer campestre 'Rozi', score=11.955223} // <- why is
> > > > > > > it
> > > > > > > 2nd?
> > > > > > > {value=Acer campestre 'Arends', score=10.640412}
> > > > > > > {value=Acer campestre subsp. leiocarpon, score=10.640412}
> > > > > > > {value=Acer campestre 'Carnival', score=10.640412}
> > > > > > > {value=Acer campestre 'Commodore', score=10.640412}
> > > > > > > {value=Acer campestre 'Nanum', score=10.640412}
> > > > > > > {value=Acer campestre 'Elsrijk', score=10.640412}
> > > > > > > {value=Acer campestre 'Fastigiatum', score=10.640412}
> > > > > > > {value=Acer campestre 'Geessink', score=10.640412}]
> > > > > > >
> > > > > > >
> > > > > > > And here is how I create my queries:
> > > > > > >
> > > > > > > final BooleanQuery.Builder builder = new
> > > > > > > BooleanQuery.Builder();
> > > > > > >     // add individual tokens to query
> > > > > > >     for (String token : fuzzyTokens) {
> > > > > > >       final Term term = new Term(NAME_TOKENS.name(),
> > > > > > > token);
> > > > > > >       final FuzzyQuery fq = new FuzzyQuery(term);
> > > > > > >       builder.add(fq, BooleanClause.Occur.SHOULD);
> > > > > > >     }
> > > > > > >     return builder.build();
> > > > > > > }
> > > > > > >
> > > > > > >
> > > > > > > Input names are analyzed with a StandardTokenizer and
> > > > > > > Lowercase
> > > > > > > filter
> > > > > > > when they are added to the IndexWriter.
> > > > > > >
> > > > > > >
> > > > > > > My question: How can I get a ranking that scores
> > > > > > > "Acer campestre 'Rozi'" higher than "Acer campestre"?
> > > > > > > I am sure there is an obvious way to achieve this that I
> > > > > > > have
> > > > > > > yet
> > > > > > > failed to find.
> > > > > > >
> > > > > > >
> > > > > > > -Matthias
> > > > > > >
> > > > > > >
> > > > > > > ---------------------------------------------------------
> > > > > > > ----
> > > > > > > --------
> > > > > > > To unsubscribe, e-mail:
> > > > > > > [email protected]
> > > > > > > For additional commands, e-mail:
> > > > > > > [email protected]
> > > > > > >
> > > > > > -----------------------------------------------------------
> > > > > > ----
> > > > > > ------
> > > > > > To unsubscribe, e-mail:
> > > > > > [email protected]
> > > > > > For additional commands, e-mail:
> > > > > > [email protected]
> > > > > >
> > > > > >
> > > > ---------------------------------------------------------------
> > > > ------
> > > > To unsubscribe, e-mail: [email protected]
> > > > For additional commands, e-mail: [email protected]
> > > >
> > >
> > > -----------------------------------------------------------------
> > > ----
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Best fuzzy match on multiple terms

Reply via email to