[ https://issues.apache.org/jira/browse/LUCENE-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795968#action_12795968 ]
Karl Wettin commented on LUCENE-1515: ------------------------------------- I just posted this to the Snowball users list: The Swedish Snowball stemmer does a terrible job according to <http://web.jhu.edu/bin/q/b/p75-mcnamee.pdf>. It even claims that lfs5, i.e. substring(0,5), does a better job. (It also says that 5-grams cracks the nut.) This didn't come as surprise to me as I've identified problems in the past and implemented my own augmentation that's been posted to this list before, now living at <http://issues.apache.org/jira/browse/LUCENE-1515>. Reading the paper made me take a closer look at what's wrong. define main_suffix as ( setlimit tomark p1 for ([substring]) among( 'a' 'arna' 'erna' 'heterna' 'orna' 'ad' 'e' 'ade' 'ande' 'arne' 'are' 'aste' 'en' 'anden' 'aren' 'heten' 'ern' 'ar' 'er' 'heter' 'or' 'as' 'arnas' 'ernas' 'ornas' 'es' 'ades' 'andes' 'ens' 'arens' 'hetens' 'erns' 'at' 'andet' 'het' 'ast' 'era' 'erar' 'erarna' 'erarnas' // augmentation starts here 'an' 'anen' 'anens' 'anare' 'aner' 'anerna' 'anernas' 'ans' 'ansen' 'ansens' 'anser' 'ansera' 'anserar' 'anserna' 'ansernas' 'iera' 'ierat' 'ierats' 'ierad' 'ierade' 'ierades' 'ikation' 'ikat' 'ikatet' 'ikatets' 'ikaten' 'ikatens' // augmentation ends here (delete) 's' (s_ending delete) In conjunction with ~200 exception rules these additions help. There are however quite a bit of problems with many of the old rules. E.g. 's' (s_ending delete) is a pluralis rule but have ~5300 exceptions where words ends with s is nominative case singularis. The problem is when written in other form than nominative case. kurs (course) kursen (the course) kursens (the [undefined noun] of the course) kurser (courses) kurserna (the courses) kursernas (the [undefined noun] of the courses) Kurs is stemmed to "kur" (which by the way will missmatch with kur as in remedy) while all the others are correctly stemmed as "kurs". All together there are, according to my estimation, some 10 000 words that will create incompatible stems between nominative case singularis and any other form. That is about 8% of the official language. One rather simple solution is to always use both unstemmed and stemmed words, e.g. as synonyms in an inverted index. But if only using the stemmed output (from the official stemmer or my augmentation) I'd argue it's better to skip stemming all together. A better solution would be to set up the stemmer to ignore the 10 000 exceptions. What would be the best way to implement this? I'd like the generated Java code to simply contain a HashSet<String> noStemExceptions; that was checked first, or something like that. > Improved(?) Swedish snowball stemmer > ------------------------------------ > > Key: LUCENE-1515 > URL: https://issues.apache.org/jira/browse/LUCENE-1515 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* > Affects Versions: 2.4 > Reporter: Karl Wettin > Attachments: LUCENE-1515.txt > > > Snowball stemmer for Swedish lacks support for '-an' and '-ans' related > suffix stripping, ending up with non compatible stems for example "klocka", > "klockor", "klockornas", "klockAN", "klockANS". Complete list of new suffix > stripping rules: > {pre} > 'an' 'anen' 'anens' 'anare' 'aner' 'anerna' 'anernas' > 'ans' 'ansen' 'ansens' 'anser' 'ansera' 'anserar' 'anserna' > 'ansernas' > 'iera' > (delete) > {pre} > The problem is all the exceptions (e.g. svans|svan, finans|fin, nyans|ny) and > this is an attempt at solving that problem. The rules and exceptions are > based on the [SAOL|http://en.wikipedia.org/wiki/Svenska_Akademiens_Ordlista] > entries suffixed with 'an' and 'ans'. There a few known problematic stemming > rules but seems to work quite a bit better than the current SwedishStemmer. > It would not be a bad idea to check all of SAOL entries in order to make sure > the integrity of the rules. > My Snowball syntax skills are rather limited so I'm certain the code could be > optimized quite a bit. > *The code is released under BSD and not ASL*. I've been posting a bit in the > Snowball forum and privatly to Martin Porter himself but never got any > response so now I post it here instead in hope for some momentum. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org