Improved(?) Swedish snowball stemmer ------------------------------------ Key: LUCENE-1515 URL: https://issues.apache.org/jira/browse/LUCENE-1515 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Affects Versions: 2.4 Reporter: Karl Wettin
Snowball stemmer for Swedish lacks support for '-an' and '-ans' related suffix stripping, ending up with non compatible stems for example "klocka", "klockor", "klockornas", "klockAN", "klockANS". Complete list of new suffix stripping rules: {pre} 'an' 'anen' 'anens' 'anare' 'aner' 'anerna' 'anernas' 'ans' 'ansen' 'ansens' 'anser' 'ansera' 'anserar' 'anserna' 'ansernas' 'iera' (delete) {pre} The problem is all the exceptions (e.g. svans|svan, finans|fin, nyans|ny) and this is an attempt at solving that problem. The rules and exceptions are based on the [SAOL|http://en.wikipedia.org/wiki/Svenska_Akademiens_Ordlista] entries suffixed with 'an' and 'ans'. There a few known problematic stemming rules but seems to work quite a bit better than the current SwedishStemmer. It would not be a bad idea to check all of SAOL entries in order to make sure the integrity of the rules. My Snowball syntax skills are rather limited so I'm certain the code could be optimized quite a bit. *The code is released under BSD and not ASL*. I've been posting a bit in the Snowball forum and privatly to Martin Porter himself but never got any response so now I post it here instead in hope for some momentum. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org