Improved(?) Swedish snowball stemmer
------------------------------------

                 Key: LUCENE-1515
                 URL: https://issues.apache.org/jira/browse/LUCENE-1515
             Project: Lucene - Java
          Issue Type: New Feature
          Components: contrib/*
    Affects Versions: 2.4
            Reporter: Karl Wettin


Snowball stemmer for Swedish lacks support for '-an' and '-ans' related suffix 
stripping, ending up with non compatible stems for example "klocka", "klockor", 
"klockornas", "klockAN", "klockANS".  Complete list of new suffix stripping 
rules:

{pre}
            'an' 'anen' 'anens' 'anare' 'aner' 'anerna' 'anernas'
            'ans' 'ansen' 'ansens' 'anser' 'ansera'  'anserar' 'anserna' 
'ansernas'
            'iera'
                (delete)
{pre}

The problem is all the exceptions (e.g. svans|svan, finans|fin, nyans|ny) and 
this is an attempt at solving that problem. The rules and exceptions are based 
on the [SAOL|http://en.wikipedia.org/wiki/Svenska_Akademiens_Ordlista] entries 
suffixed with 'an' and 'ans'. There a few known problematic stemming rules but 
seems to work quite a bit better than the current SwedishStemmer. It would not 
be a bad idea to check all of SAOL entries in order to make sure the integrity 
of the rules.

My Snowball syntax skills are rather limited so I'm certain the code could be 
optimized quite a bit.

*The code is released under BSD and not ASL*. I've been posting a bit in the 
Snowball forum and privatly to Martin Porter himself but never got any response 
so now I post it here instead in hope for some momentum.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to