Trey Jones created LUCENE-8419:
----------------------------------

             Summary: Return token unchanged for pathological Stempel tokens
                 Key: LUCENE-8419
                 URL: https://issues.apache.org/jira/browse/LUCENE-8419
             Project: Lucene - Core
          Issue Type: New Feature
          Components: modules/analysis
            Reporter: Trey Jones
         Attachments: dotc.txt, dotdotc.txt, twoletter.txt

In the aggregate, Stempel does a good job, but certain tokens get stemmed 
pathologically, conflating completely unrelated words in the search index. 
Depending on the scoring function, documents returned may have no form of the 
word that was in the query, only unrelated forms (see ć examples below).

It's probably not possible to fix the stemmer, and it's probably not possible 
to catch _every_ error, but catching and ignoring certain large classes of 
errors would greatly improve precision, and doing it in the stemmer would 
prevent losses to recall that happen from cleaning up these errors outside the 
stemmer.

An obvious example is that numbers ending in 1 have the last two digits 
replaced with ć. So 12341 is stemmed as 123ć. Numbers ending in 31 have the 
last 4 numbers removed and replaced with ć, so 12331 is stemmed as 1ć. Mixed 
letters and numbers are treated the same: abc123451 is stemmed as abc1234ć, 
abc1231 is stemmed as abcć.

*Proposed solution:* any token that ends in a number should not be stemmed, it 
should just be returned unchanged.

One letter stems from the set [a-zńć] are generally useless and often absurd.

ć is the worst offender by far (it's the ending of the infinitive form of 
verbs). All of these tokens (found on Polish Wikipedia/Wiktionary) get stemmed 
to ć:
 * acque Adrien aguas Águas Alainem Alandh Amores Ansoe Arau asinaio aŭdas 
audyt Awiwie Ayres Baby badż Baina Bains Balue Baon baque Barbola Bazy Beau 
beim Beroe Betz Blaue blenda bleue Blizzard boor Boruca Boym Brodła Brogi 
Bronksie Brydż Budgie Budiafa bujny Buon Buot Button Caan Cains Canoe Canona 
caon Celu Charl Chloe ciag Cioma Cmdr Conseil Conso Cotton Cramp Creel Cuyk 
cyan czcią Czermny czto D.III Daws Daxue dazzle decy Defoe Dereń Detroit digue 
Dior Ditton Dojlido dosei douk DRaaS drag drau Dudacy dudas Dutton Duty Dziób 
eayd Edwy Edyp eiro Eltz Emain erar ESaaS faan Fetz figurar Fitz foam Frau 
Fugue GAAB gaan Gabirol Gaon gasue Gaup Geol GeoMIP Getz gigue Ginny Gioią Girl 
Goam Gołymin Gosei Götz grasso Grodnie Gula Guroo gyan HAAB Haan Heim Héroe 
Hitz Hoam Hohenho Hosei Huon Hutton Huub hyaina Iberii inkuby Inoue Issue ITaaS 
Iudas Izmaile Jaan Jaws jedyn Jews jira Josepho Jost Josue Judas Kaan Kaleido 
Karoo Katz Kazue Kehoe khayag kiwa Kiwu Klaas kmdr Kokei Konoe kozer kpią 
Kringle ksiezyce Któż Kutz L231 L331 Laan Lalli Laon Laws łebka Leroo Liban 
Ligue Liro Lisoli Logue Loja Londyn Lubomyr Luque Lutz Lytton łzawy Maan mains 
Mainy malpaco Mammal mandag MBaaS meeki Merl Metz MIDAS middag Miras mmol modą 
moins Monty Moryń motz mróż Mutz Müzesi MVaaS Naam nabrzeża Nadab Nadala 
Nalewki Nd:YAG neol News Nieszawa Nimue Nyam ÖAAB oblał oddala okala Olień opar 
oppi Orioł Osioł osoagi Osyki Otóż Output Oxalido pasmową Patton Pearl Peau 
peoplk Petz poar Pobrzeża poecie Pogue Pono posagi posł Praha Pringle probie 
progi Prońko Prosper prwdę Psioł Pułka Putz QDTOE Quien Qwest radża raga Rains 
reht Reich Retz Revue Right RITZ Roam Rogue Roque rosii RU31 Rutki Ryan SAAB 
saasso salue Sampaio Satz Sears Sekisho semo Setton Sgan Siloe Sitz Skopje Slot 
Šmarje Smrkci Soar sopo sozinho springa Steel Stip Straz Strip Suez sukuby 
Sumach Surgucie Sutton svasso Szosą szto Tadas Taira tęczy Teodorą teol Tisii 
Tisza Toluca Tomoe Toque TPMŻ Traiana Trask Traue Tulyag Tuque Turinga Undas 
Uniw usque Vague Value Venue Vidas Vogue Voor W331 Waringa weht Weich Weija 
Wheel widmem WKAG worku Wotton Wryk Wschowie wsiach wsiami Wybrzeża wydala 
Wyraz XLIII XVIII XXIII Yaski yeol YONO Yorki zakręcie Zijab zipo.

Four-character tokens ending in 31 (like 2,31 9,31 1031 1131 7431 8331 a331) 
also all get stemmed to ć.

Below are examples of other tokens (from Polish Wikipedia/Wiktionary) that get 
stemmed to one-letter tokens in [a-zńć]. Note that i, o, u, w, and z are stop 
words, and so don't show up in the list.
 * a: a, addo, adygea, jhwh, also
 * b: b, bdrm, barr, bebek, berr, bounty, bures, burr, berm, birm
 * c: alzira, c, carr, county, haight, hermas, kidoń, paich, pieter, połóż, 
radoń, soest, tatort, voight, zaba, biegną, pokaż, wskaż, zoisyt
 * d: award, d, dlek, deeb
 * e: e, eddy, eloi
 * f: f, farr, firm
 * g: g, geagea, grunty, gwdy, gyro, górą
 * h: h
 * i: inre, isro
 * j: j, judo
 * k: k, kgtj, kpzr, karr, kerr, ksok
 * l: l, leeb, loeb
 * m: m, magazyn, marr, mayor, merr, mnsi, murr, mgły, najmu
 * n: johnowi, n
 * o: obzr, offy
 * p: p, pace, paoli, parr, pasji, pawełek, pyro, pirsy, plmb
 * q: q
 * r: r, rite, rrek
 * s: s, sarr, site, sowie, szok
 * t: leźnie, t, tnsw, tooi
 * u: noite
 * w: wmro, warr, wifi, wyspom, wątki
 * x: x
 * y: jesteś, lafleur, nate, nowsze, violeur, y, yach, douleur
 * z: czok, skrawek
 * ń: cisew, esso

All other one-character stems I have encountered have been for one-character 
input tokens (especially those in other writing systems).

*Proposed solution:* if a token gets stemmed to a one-letter stem (either in 
general, or specifically if the letter is one of [a-zńć]), the input token 
should be returned unchanged.

There are other patterns of unreliable stems, though the ones above are the 
worst.

Two-letter stems are generally unreliable (see attachement twoletter.txt). The 
specific stems my, um, ąc, and ły are particularly random.

Two- and three-letter stems fitting the patterns .ć and ..ć are generally not 
useful (see attachments dotc.txt and dotdotc.txt for full lists of examples). 
The specific stems ać, eć, yć, ąć, ść, and źć are particularly random.

The specific stems ować, iwać, obić, snąć, ywać, ium also stand out as 
egregious:
 * ium: IIIC, Treze
 * iwać: Blefa, Crew, Iwano, Krall, Leseur, Maksiu, Stefa, Wrycz, cygar, horou
 * obić: Dawka, Obiło, dawka, obicia, obito
 * ować: Abdou, Bangu, Beess, Biblie, Birmie, Bohle, Bredy, Buddę, Czubą, 
Darją, Fatou, Firmie, Füssli, Ghany, Haeng, Katją, Koszyc, Ligę, Limie, Madou, 
Ozmy, Pitou, Riess, Sloane, Smółka, Soeng, TheFa, UWSS, firmie, ligę, szury, 
úzkost
 * snąć: Koziej, Schwab, Serial, Spain, serial
 * ywać: Ariza, odkuł, sorgo


*Proposed solution:* Return the input token if the stem meets one or more of 
the following criteria:
 # stem matches /^[a-zął][a-zćń]$/
 # stem matches /^.ć/
 # stem is one of my, um, ąc, ły, ać, eć, yć, ąć, ść, or źć
 # stem matches /^..ć/
 # stem is one of ować, iwać, obić, snąć, ywać, ium

Note: (1) is a superset of (2) and (3). (2) does not cover my, um, ąc, or ły in 
(3), so (2) and part of (3) could be combined.

*General workaround:* Unpack Stempel into constituent parts, recreate Stempel's 
stopword list as a stop filter (see LUCENE-8417), use polish_stem as a stemmer, 
use a pattern_replace filter to replace /^([a-zął]?[a-zćń]|..ć|\d.*ć)$/ with 
'', and then a length filter to remove zero-length tokens, and add a stop 
filter with ować, iwać, obić, snąć, ywać, ium. Since many tokens are lost by 
this process, you need to also have an unstemmed index of the same text so you 
don't lose recall. (That's not exactly "easy", but it's what I've had to do.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to