[jira] [Commented] (SOLR-2764) Create a NorwegianLightStemmer and NorwegianMinimalStemmer
[ https://issues.apache.org/jira/browse/SOLR-2764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233472#comment-13233472 ] Robert Muir commented on SOLR-2764: --- Very nice work Jan! Create a NorwegianLightStemmer and NorwegianMinimalStemmer -- Key: SOLR-2764 URL: https://issues.apache.org/jira/browse/SOLR-2764 Project: Solr Issue Type: New Feature Components: Schema and Analysis Reporter: Jan Høydahl Assignee: Jan Høydahl Fix For: 3.6, 4.0 Attachments: SOLR-2764.patch, SOLR-2764.patch, SOLR-2764.patch, SOLR-2764.patch, SOLR-2764.patch We need a simple light-weight stemmer and a minimal stemmer for plural/singlular only in Norwegian -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2764) Create a NorwegianLightStemmer and NorwegianMinimalStemmer
[ https://issues.apache.org/jira/browse/SOLR-2764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228345#comment-13228345 ] Jan Høydahl commented on SOLR-2764: --- Will try to prepare a new patch for this when time allows, with one-pass. Create a NorwegianLightStemmer and NorwegianMinimalStemmer -- Key: SOLR-2764 URL: https://issues.apache.org/jira/browse/SOLR-2764 Project: Solr Issue Type: New Feature Components: Schema and Analysis Reporter: Jan Høydahl Assignee: Jan Høydahl Fix For: 3.6, 4.0 Attachments: SOLR-2764.patch, SOLR-2764.patch, SOLR-2764.patch, SOLR-2764.patch We need a simple light-weight stemmer and a minimal stemmer for plural/singlular only in Norwegian -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2764) Create a NorwegianLightStemmer and NorwegianMinimalStemmer
[ https://issues.apache.org/jira/browse/SOLR-2764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13199227#comment-13199227 ] Jan Høydahl commented on SOLR-2764: --- When looking at words enging in -het and -dom in dictionaries (such as Ooo nb_NO.dic), the base word has the same meaning in the vast majority of cases. But of course there will be exceptions. Take the word brennhet (het as in hot), it will be stemmed to brenn - bren which is kind of wrong, but then bren is not a valid word so it won't cause errors. There may be such cases where the final stem clashes with another word, but not more than the base rules. I.e. there is a Norwegian surname Brenna which will be stemmed to brenn by the -a rule, believing it's a fem.definite ending, and then we get a clash with the verb brenn (burn). And the first name Tore (boy) or Tora (girl) will be stemmed to Tor (boy) which is another valid first name... My hunch is that the -dom/-het rules make more good than wrong. Mainly because in the majority of cases it leads to the base word and the -het/-dom word being stemmed to the same stem in cases where the -en/-et/-a/-e/-n rule are applied wrongly. Example: {noformat} One pass Two passes forlegenforleg forlegenforleg forlegenhet forlegen forlegenhet forleg forlegenheten forlegen forlegenheten forleg forlegenhetens forlegen forlegenhetens forleg firkantet firkantfirkantet firkant firkantethetfirkantet firkantethetfirkant firkantetheten firkantet firkantetheten firkant {noformat} But I think maybe the rules -dommer and -dommen should be removed, because the word dommer (judge) and dommen (the sentence) are both common words valid in word endings. So the word linjedommer (linesman) would be stemmed to linje (line) which is too aggressive. I see that it soon gets complicated to try to be clever. Should we go back to the one-pass again for the light stemmer? Christian? Create a NorwegianLightStemmer and NorwegianMinimalStemmer -- Key: SOLR-2764 URL: https://issues.apache.org/jira/browse/SOLR-2764 Project: Solr Issue Type: New Feature Components: Schema and Analysis Reporter: Jan Høydahl Fix For: 3.6, 4.0 Attachments: SOLR-2764.patch, SOLR-2764.patch, SOLR-2764.patch, SOLR-2764.patch We need a simple light-weight stemmer and a minimal stemmer for plural/singlular only in Norwegian -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2764) Create a NorwegianLightStemmer and NorwegianMinimalStemmer
[ https://issues.apache.org/jira/browse/SOLR-2764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13199235#comment-13199235 ] Robert Muir commented on SOLR-2764: --- Jan, i wasn't trying to be critical about these endings, because of course some of the existing light stemmers have a few _selected_ derivational endings that are taken care of. And thats really what its all about, when we are talking about something like adjective-noun, I didnt mean to say we shouldn't do it, because it sounds quite reasonable: but we should explore the options. For example, as an alternative to multi-pass, a 'less elegant to some' but really practical way to go about it can be to 'multiply through' and convert the possibilities to single-pass. E.g. the typical 'undrinkables' hunspell example: if i have the english inflectional plural ending -s and the derivational ending -able, instead of: * pass 1: remove inflectional endings (e.g. -s) * pass 2: remove derivational endings (e.g. -able) we just take all the pass 2 endings that are compatible with pass 1 endings and cross-multiply, to make a single pass algorithm. some won't be compatible, (so we won't combine -able + -s into -ables). I'm not sure if this is helpful for the norwegian case as I'm not as familiar with it, just an idea. Create a NorwegianLightStemmer and NorwegianMinimalStemmer -- Key: SOLR-2764 URL: https://issues.apache.org/jira/browse/SOLR-2764 Project: Solr Issue Type: New Feature Components: Schema and Analysis Reporter: Jan Høydahl Fix For: 3.6, 4.0 Attachments: SOLR-2764.patch, SOLR-2764.patch, SOLR-2764.patch, SOLR-2764.patch We need a simple light-weight stemmer and a minimal stemmer for plural/singlular only in Norwegian -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2764) Create a NorwegianLightStemmer and NorwegianMinimalStemmer
[ https://issues.apache.org/jira/browse/SOLR-2764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13195863#comment-13195863 ] Robert Muir commented on SOLR-2764: --- just some general suggestions: in a light stemmer, i would be wary of derivational endings. it seems in the case of dom/het because its dealing with adj/noun that its on the edge (maybe ok here), but if possible it would be more ideal to avoid multiple passes... this is the kind of thing that causes snowball problems. Can you think of examples for dom/het where the meaning would be changed? for example: freedom is used the same way in english, but stemming this to free is very lossy, since free has a variety of meanings (such as costs nothing), some of which are incompatible with freedom. This is the danger of stripping derivational suffixes... Create a NorwegianLightStemmer and NorwegianMinimalStemmer -- Key: SOLR-2764 URL: https://issues.apache.org/jira/browse/SOLR-2764 Project: Solr Issue Type: New Feature Components: Schema and Analysis Reporter: Jan Høydahl Fix For: 3.6, 4.0 Attachments: SOLR-2764.patch, SOLR-2764.patch, SOLR-2764.patch, SOLR-2764.patch We need a simple light-weight stemmer and a minimal stemmer for plural/singlular only in Norwegian -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2764) Create a NorwegianLightStemmer and NorwegianMinimalStemmer
[ https://issues.apache.org/jira/browse/SOLR-2764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13195451#comment-13195451 ] Christian Moen commented on SOLR-2764: -- I added a few entries to the tests, including some irregular ones, to validate and illustrate how the stemmer works in these cases. Jan, looks good to me. +1 Create a NorwegianLightStemmer and NorwegianMinimalStemmer -- Key: SOLR-2764 URL: https://issues.apache.org/jira/browse/SOLR-2764 Project: Solr Issue Type: New Feature Components: Schema and Analysis Reporter: Jan Høydahl Fix For: 3.6, 4.0 Attachments: SOLR-2764.patch, SOLR-2764.patch, SOLR-2764.patch We need a simple light-weight stemmer and a minimal stemmer for plural/singlular only in Norwegian -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org