Trey Jones created LUCENE-8419:
----------------------------------
Summary: Return token unchanged for pathological Stempel tokens
Key: LUCENE-8419
URL: https://issues.apache.org/jira/browse/LUCENE-8419
Project: Lucene - Core
Issue Type: New Feature
Components: modules/analysis
Reporter: Trey Jones
Attachments: dotc.txt, dotdotc.txt, twoletter.txt
In the aggregate, Stempel does a good job, but certain tokens get stemmed
pathologically, conflating completely unrelated words in the search index.
Depending on the scoring function, documents returned may have no form of the
word that was in the query, only unrelated forms (see ć examples below).
It's probably not possible to fix the stemmer, and it's probably not possible
to catch _every_ error, but catching and ignoring certain large classes of
errors would greatly improve precision, and doing it in the stemmer would
prevent losses to recall that happen from cleaning up these errors outside the
stemmer.
An obvious example is that numbers ending in 1 have the last two digits
replaced with ć. So 12341 is stemmed as 123ć. Numbers ending in 31 have the
last 4 numbers removed and replaced with ć, so 12331 is stemmed as 1ć. Mixed
letters and numbers are treated the same: abc123451 is stemmed as abc1234ć,
abc1231 is stemmed as abcć.
*Proposed solution:* any token that ends in a number should not be stemmed, it
should just be returned unchanged.
One letter stems from the set [a-zńć] are generally useless and often absurd.
ć is the worst offender by far (it's the ending of the infinitive form of
verbs). All of these tokens (found on Polish Wikipedia/Wiktionary) get stemmed
to ć:
* acque Adrien aguas Águas Alainem Alandh Amores Ansoe Arau asinaio aŭdas
audyt Awiwie Ayres Baby badż Baina Bains Balue Baon baque Barbola Bazy Beau
beim Beroe Betz Blaue blenda bleue Blizzard boor Boruca Boym Brodła Brogi
Bronksie Brydż Budgie Budiafa bujny Buon Buot Button Caan Cains Canoe Canona
caon Celu Charl Chloe ciag Cioma Cmdr Conseil Conso Cotton Cramp Creel Cuyk
cyan czcią Czermny czto D.III Daws Daxue dazzle decy Defoe Dereń Detroit digue
Dior Ditton Dojlido dosei douk DRaaS drag drau Dudacy dudas Dutton Duty Dziób
eayd Edwy Edyp eiro Eltz Emain erar ESaaS faan Fetz figurar Fitz foam Frau
Fugue GAAB gaan Gabirol Gaon gasue Gaup Geol GeoMIP Getz gigue Ginny Gioią Girl
Goam Gołymin Gosei Götz grasso Grodnie Gula Guroo gyan HAAB Haan Heim Héroe
Hitz Hoam Hohenho Hosei Huon Hutton Huub hyaina Iberii inkuby Inoue Issue ITaaS
Iudas Izmaile Jaan Jaws jedyn Jews jira Josepho Jost Josue Judas Kaan Kaleido
Karoo Katz Kazue Kehoe khayag kiwa Kiwu Klaas kmdr Kokei Konoe kozer kpią
Kringle ksiezyce Któż Kutz L231 L331 Laan Lalli Laon Laws łebka Leroo Liban
Ligue Liro Lisoli Logue Loja Londyn Lubomyr Luque Lutz Lytton łzawy Maan mains
Mainy malpaco Mammal mandag MBaaS meeki Merl Metz MIDAS middag Miras mmol modą
moins Monty Moryń motz mróż Mutz Müzesi MVaaS Naam nabrzeża Nadab Nadala
Nalewki Nd:YAG neol News Nieszawa Nimue Nyam ÖAAB oblał oddala okala Olień opar
oppi Orioł Osioł osoagi Osyki Otóż Output Oxalido pasmową Patton Pearl Peau
peoplk Petz poar Pobrzeża poecie Pogue Pono posagi posł Praha Pringle probie
progi Prońko Prosper prwdę Psioł Pułka Putz QDTOE Quien Qwest radża raga Rains
reht Reich Retz Revue Right RITZ Roam Rogue Roque rosii RU31 Rutki Ryan SAAB
saasso salue Sampaio Satz Sears Sekisho semo Setton Sgan Siloe Sitz Skopje Slot
Šmarje Smrkci Soar sopo sozinho springa Steel Stip Straz Strip Suez sukuby
Sumach Surgucie Sutton svasso Szosą szto Tadas Taira tęczy Teodorą teol Tisii
Tisza Toluca Tomoe Toque TPMŻ Traiana Trask Traue Tulyag Tuque Turinga Undas
Uniw usque Vague Value Venue Vidas Vogue Voor W331 Waringa weht Weich Weija
Wheel widmem WKAG worku Wotton Wryk Wschowie wsiach wsiami Wybrzeża wydala
Wyraz XLIII XVIII XXIII Yaski yeol YONO Yorki zakręcie Zijab zipo.
Four-character tokens ending in 31 (like 2,31 9,31 1031 1131 7431 8331 a331)
also all get stemmed to ć.
Below are examples of other tokens (from Polish Wikipedia/Wiktionary) that get
stemmed to one-letter tokens in [a-zńć]. Note that i, o, u, w, and z are stop
words, and so don't show up in the list.
* a: a, addo, adygea, jhwh, also
* b: b, bdrm, barr, bebek, berr, bounty, bures, burr, berm, birm
* c: alzira, c, carr, county, haight, hermas, kidoń, paich, pieter, połóż,
radoń, soest, tatort, voight, zaba, biegną, pokaż, wskaż, zoisyt
* d: award, d, dlek, deeb
* e: e, eddy, eloi
* f: f, farr, firm
* g: g, geagea, grunty, gwdy, gyro, górą
* h: h
* i: inre, isro
* j: j, judo
* k: k, kgtj, kpzr, karr, kerr, ksok
* l: l, leeb, loeb
* m: m, magazyn, marr, mayor, merr, mnsi, murr, mgły, najmu
* n: johnowi, n
* o: obzr, offy
* p: p, pace, paoli, parr, pasji, pawełek, pyro, pirsy, plmb
* q: q
* r: r, rite, rrek
* s: s, sarr, site, sowie, szok
* t: leźnie, t, tnsw, tooi
* u: noite
* w: wmro, warr, wifi, wyspom, wątki
* x: x
* y: jesteś, lafleur, nate, nowsze, violeur, y, yach, douleur
* z: czok, skrawek
* ń: cisew, esso
All other one-character stems I have encountered have been for one-character
input tokens (especially those in other writing systems).
*Proposed solution:* if a token gets stemmed to a one-letter stem (either in
general, or specifically if the letter is one of [a-zńć]), the input token
should be returned unchanged.
There are other patterns of unreliable stems, though the ones above are the
worst.
Two-letter stems are generally unreliable (see attachement twoletter.txt). The
specific stems my, um, ąc, and ły are particularly random.
Two- and three-letter stems fitting the patterns .ć and ..ć are generally not
useful (see attachments dotc.txt and dotdotc.txt for full lists of examples).
The specific stems ać, eć, yć, ąć, ść, and źć are particularly random.
The specific stems ować, iwać, obić, snąć, ywać, ium also stand out as
egregious:
* ium: IIIC, Treze
* iwać: Blefa, Crew, Iwano, Krall, Leseur, Maksiu, Stefa, Wrycz, cygar, horou
* obić: Dawka, Obiło, dawka, obicia, obito
* ować: Abdou, Bangu, Beess, Biblie, Birmie, Bohle, Bredy, Buddę, Czubą,
Darją, Fatou, Firmie, Füssli, Ghany, Haeng, Katją, Koszyc, Ligę, Limie, Madou,
Ozmy, Pitou, Riess, Sloane, Smółka, Soeng, TheFa, UWSS, firmie, ligę, szury,
úzkost
* snąć: Koziej, Schwab, Serial, Spain, serial
* ywać: Ariza, odkuł, sorgo
*Proposed solution:* Return the input token if the stem meets one or more of
the following criteria:
# stem matches /^[a-zął][a-zćń]$/
# stem matches /^.ć/
# stem is one of my, um, ąc, ły, ać, eć, yć, ąć, ść, or źć
# stem matches /^..ć/
# stem is one of ować, iwać, obić, snąć, ywać, ium
Note: (1) is a superset of (2) and (3). (2) does not cover my, um, ąc, or ły in
(3), so (2) and part of (3) could be combined.
*General workaround:* Unpack Stempel into constituent parts, recreate Stempel's
stopword list as a stop filter (see LUCENE-8417), use polish_stem as a stemmer,
use a pattern_replace filter to replace /^([a-zął]?[a-zćń]|..ć|\d.*ć)$/ with
'', and then a length filter to remove zero-length tokens, and add a stop
filter with ować, iwać, obić, snąć, ywać, ium. Since many tokens are lost by
this process, you need to also have an unstemmed index of the same text so you
don't lose recall. (That's not exactly "easy", but it's what I've had to do.)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]