[ 
https://issues.apache.org/jira/browse/SOLR-2764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194874#comment-13194874
 ] 

Robert Muir commented on SOLR-2764:
-----------------------------------

Looks nice to me actually. I can't tell from the test data what you are using 
already (binary file),
but a few suggestions for testing (this was the process I used before):

* the existing actual+expected testdata for the light stemmers were generated 
by running the C implementations against snowball vocabulary sets, 
I took the vocabulary files from snowball (the voc.txt in 
TestSnowballVocabData.zip), and ran the original implementations over them 
and created expected output. This is just a broad check that our implementation 
does the same thing as the original C one.
I'm not sure how great of a vocabulary set that is for norwegian though.

* in this case, you don't actually have an existing evaluated impl you are 
trying to conform to, so this test is not so useful,
except to check for PorterStemmer-type JRE crashes and to ensure any future 
refactorings aren't changing the algorithm (breaking index back compat).

Personally at a glance this looks pretty conservative and nice, but I think 
since there is no published algorithm to refer to, it might be nice
to add some notes to the Stemmer's java file describing some high level stuff, 
and also some individual tests that are just examples showing what it does.

Take a look at Latvian (lv) for an example. In this case the algorithm is not 
exactly what was published in the referred phd thesis,
I did actually implement the original algorithm but my tests found it to be 
extremely aggressive... so its similar to your case I think.

                
> Create a NorwegianLightStemmer
> ------------------------------
>
>                 Key: SOLR-2764
>                 URL: https://issues.apache.org/jira/browse/SOLR-2764
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Jan Høydahl
>         Attachments: SOLR-2764.patch
>
>
> We need a light-weight stemmer for plural/singlular only in Norwegian

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to