[ 
https://issues.apache.org/jira/browse/LUCENE-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796068#action_12796068
 ] 

Robert Muir commented on LUCENE-1515:
-------------------------------------

bq. A better solution would be to set up the stemmer to ignore the 10 000 
exceptions. What would be the best way to implement this? I'd like the 
generated Java code to simply contain a HashSet<String> noStemExceptions; that 
was checked first, or something like that.

Hi Karl, in my opinion the best way to handle this would be outside of Snowball 
itself. This is really a problem beyond swedish and even the snowball stemmers: 
I think for many cases (other languages) people might have a list of 'protected 
words' they do not want the stemmer to mess with. Examples are common proper 
names, things like that.

This is currently a mess in my opinion:
* Solr has this functionality, but only for snowball, etc because they do not 
actually use lucene's snowballfilter! Instead they have their own 
implementation (duplicate code)
* in some cases, our non-snowball stemmers support this: take a look at 
BrazilianStemmer, it has Set<?> stemExclusionSet, but this is inconsistent, 
most of our stemmers do not actually support this.

I think I would like to propose the following as a potential idea:
Just like Simon did with the inconsistent stopword handling, we could refactor 
some handling into all Stemming TokenFilters (and probably also hooks into the 
analyzers, too) that simply allows the system to use a CharArraySet for 'stem 
ignore words', loaded via WordListLoader from a text file, or however.

With this, you could create a text file with these 10,000 words as a default 
for swedish.
This would remove the 'protected words' duplication from solr too in the 
future, and allow for protected words functionality across all stemmers.

> Improved(?) Swedish snowball stemmer
> ------------------------------------
>
>                 Key: LUCENE-1515
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1515
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>    Affects Versions: 2.4
>            Reporter: Karl Wettin
>         Attachments: LUCENE-1515.txt
>
>
> Snowball stemmer for Swedish lacks support for '-an' and '-ans' related 
> suffix stripping, ending up with non compatible stems for example "klocka", 
> "klockor", "klockornas", "klockAN", "klockANS".  Complete list of new suffix 
> stripping rules:
> {pre}
>             'an' 'anen' 'anens' 'anare' 'aner' 'anerna' 'anernas'
>             'ans' 'ansen' 'ansens' 'anser' 'ansera'  'anserar' 'anserna' 
> 'ansernas'
>             'iera'
>                 (delete)
> {pre}
> The problem is all the exceptions (e.g. svans|svan, finans|fin, nyans|ny) and 
> this is an attempt at solving that problem. The rules and exceptions are 
> based on the [SAOL|http://en.wikipedia.org/wiki/Svenska_Akademiens_Ordlista] 
> entries suffixed with 'an' and 'ans'. There a few known problematic stemming 
> rules but seems to work quite a bit better than the current SwedishStemmer. 
> It would not be a bad idea to check all of SAOL entries in order to make sure 
> the integrity of the rules.
> My Snowball syntax skills are rather limited so I'm certain the code could be 
> optimized quite a bit.
> *The code is released under BSD and not ASL*. I've been posting a bit in the 
> Snowball forum and privatly to Martin Porter himself but never got any 
> response so now I post it here instead in hope for some momentum.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to