[ https://issues.apache.org/jira/browse/LUCENE-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796068#action_12796068 ]
Robert Muir commented on LUCENE-1515: ------------------------------------- bq. A better solution would be to set up the stemmer to ignore the 10 000 exceptions. What would be the best way to implement this? I'd like the generated Java code to simply contain a HashSet<String> noStemExceptions; that was checked first, or something like that. Hi Karl, in my opinion the best way to handle this would be outside of Snowball itself. This is really a problem beyond swedish and even the snowball stemmers: I think for many cases (other languages) people might have a list of 'protected words' they do not want the stemmer to mess with. Examples are common proper names, things like that. This is currently a mess in my opinion: * Solr has this functionality, but only for snowball, etc because they do not actually use lucene's snowballfilter! Instead they have their own implementation (duplicate code) * in some cases, our non-snowball stemmers support this: take a look at BrazilianStemmer, it has Set<?> stemExclusionSet, but this is inconsistent, most of our stemmers do not actually support this. I think I would like to propose the following as a potential idea: Just like Simon did with the inconsistent stopword handling, we could refactor some handling into all Stemming TokenFilters (and probably also hooks into the analyzers, too) that simply allows the system to use a CharArraySet for 'stem ignore words', loaded via WordListLoader from a text file, or however. With this, you could create a text file with these 10,000 words as a default for swedish. This would remove the 'protected words' duplication from solr too in the future, and allow for protected words functionality across all stemmers. > Improved(?) Swedish snowball stemmer > ------------------------------------ > > Key: LUCENE-1515 > URL: https://issues.apache.org/jira/browse/LUCENE-1515 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* > Affects Versions: 2.4 > Reporter: Karl Wettin > Attachments: LUCENE-1515.txt > > > Snowball stemmer for Swedish lacks support for '-an' and '-ans' related > suffix stripping, ending up with non compatible stems for example "klocka", > "klockor", "klockornas", "klockAN", "klockANS". Complete list of new suffix > stripping rules: > {pre} > 'an' 'anen' 'anens' 'anare' 'aner' 'anerna' 'anernas' > 'ans' 'ansen' 'ansens' 'anser' 'ansera' 'anserar' 'anserna' > 'ansernas' > 'iera' > (delete) > {pre} > The problem is all the exceptions (e.g. svans|svan, finans|fin, nyans|ny) and > this is an attempt at solving that problem. The rules and exceptions are > based on the [SAOL|http://en.wikipedia.org/wiki/Svenska_Akademiens_Ordlista] > entries suffixed with 'an' and 'ans'. There a few known problematic stemming > rules but seems to work quite a bit better than the current SwedishStemmer. > It would not be a bad idea to check all of SAOL entries in order to make sure > the integrity of the rules. > My Snowball syntax skills are rather limited so I'm certain the code could be > optimized quite a bit. > *The code is released under BSD and not ASL*. I've been posting a bit in the > Snowball forum and privatly to Martin Porter himself but never got any > response so now I post it here instead in hope for some momentum. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org