[jira] [Commented] (SOLR-2764) Create a NorwegianLightStemmer and NorwegianMinimalStemmer
[ https://issues.apache.org/jira/browse/SOLR-2764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233472#comment-13233472 ] Robert Muir commented on SOLR-2764: --- Very nice work Jan! > Create a NorwegianLightStemmer and NorwegianMinimalStemmer > -- > > Key: SOLR-2764 > URL: https://issues.apache.org/jira/browse/SOLR-2764 > Project: Solr > Issue Type: New Feature > Components: Schema and Analysis >Reporter: Jan Høydahl >Assignee: Jan Høydahl > Fix For: 3.6, 4.0 > > Attachments: SOLR-2764.patch, SOLR-2764.patch, SOLR-2764.patch, > SOLR-2764.patch, SOLR-2764.patch > > > We need a simple light-weight stemmer and a minimal stemmer for > plural/singlular only in Norwegian -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2764) Create a NorwegianLightStemmer and NorwegianMinimalStemmer
[ https://issues.apache.org/jira/browse/SOLR-2764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228345#comment-13228345 ] Jan Høydahl commented on SOLR-2764: --- Will try to prepare a new patch for this when time allows, with one-pass. > Create a NorwegianLightStemmer and NorwegianMinimalStemmer > -- > > Key: SOLR-2764 > URL: https://issues.apache.org/jira/browse/SOLR-2764 > Project: Solr > Issue Type: New Feature > Components: Schema and Analysis >Reporter: Jan Høydahl >Assignee: Jan Høydahl > Fix For: 3.6, 4.0 > > Attachments: SOLR-2764.patch, SOLR-2764.patch, SOLR-2764.patch, > SOLR-2764.patch > > > We need a simple light-weight stemmer and a minimal stemmer for > plural/singlular only in Norwegian -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2764) Create a NorwegianLightStemmer and NorwegianMinimalStemmer
[ https://issues.apache.org/jira/browse/SOLR-2764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13199235#comment-13199235 ] Robert Muir commented on SOLR-2764: --- Jan, i wasn't trying to be critical about these endings, because of course some of the existing light stemmers have a few _selected_ derivational endings that are taken care of. And thats really what its all about, when we are talking about something like adjective->noun, I didnt mean to say we shouldn't do it, because it sounds quite reasonable: but we should explore the options. For example, as an alternative to multi-pass, a 'less elegant to some' but really practical way to go about it can be to 'multiply through' and convert the possibilities to single-pass. E.g. the typical 'undrinkables' hunspell example: if i have the english inflectional plural ending -s and the derivational ending -able, instead of: * pass 1: remove inflectional endings (e.g. -s) * pass 2: remove derivational endings (e.g. -able) we just take all the pass 2 endings that are compatible with pass 1 endings and cross-multiply, to make a single pass algorithm. some won't be compatible, (so we won't combine -able + -s into -ables). I'm not sure if this is helpful for the norwegian case as I'm not as familiar with it, just an idea. > Create a NorwegianLightStemmer and NorwegianMinimalStemmer > -- > > Key: SOLR-2764 > URL: https://issues.apache.org/jira/browse/SOLR-2764 > Project: Solr > Issue Type: New Feature > Components: Schema and Analysis >Reporter: Jan Høydahl > Fix For: 3.6, 4.0 > > Attachments: SOLR-2764.patch, SOLR-2764.patch, SOLR-2764.patch, > SOLR-2764.patch > > > We need a simple light-weight stemmer and a minimal stemmer for > plural/singlular only in Norwegian -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2764) Create a NorwegianLightStemmer and NorwegianMinimalStemmer
[ https://issues.apache.org/jira/browse/SOLR-2764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13199227#comment-13199227 ] Jan Høydahl commented on SOLR-2764: --- When looking at words enging in -het and -dom in dictionaries (such as Ooo nb_NO.dic), the base word has the same meaning in the vast majority of cases. But of course there will be exceptions. Take the word "brennhet" (het as in hot), it will be stemmed to "brenn" -> "bren" which is kind of wrong, but then "bren" is not a valid word so it won't cause errors. There may be such cases where the final stem clashes with another word, but not more than the base rules. I.e. there is a Norwegian surname "Brenna" which will be stemmed to "brenn" by the "-a" rule, believing it's a fem.definite ending, and then we get a clash with the verb "brenn" (burn). And the first name "Tore" (boy) or "Tora" (girl) will be stemmed to "Tor" (boy) which is another valid first name... My hunch is that the -dom/-het rules make more good than wrong. Mainly because in the majority of cases it leads to the base word and the -het/-dom word being stemmed to the same stem in cases where the "-en/-et/-a/-e/-n" rule are applied wrongly. Example: {noformat} One pass Two passes forlegenforleg forlegenforleg forlegenhet forlegen forlegenhet forleg forlegenheten forlegen forlegenheten forleg forlegenhetens forlegen forlegenhetens forleg firkantet firkantfirkantet firkant firkantethetfirkantet firkantethetfirkant firkantetheten firkantet firkantetheten firkant {noformat} But I think maybe the rules -dommer and -dommen should be removed, because the word dommer (judge) and dommen (the sentence) are both common words valid in word endings. So the word "linjedommer" (linesman) would be stemmed to "linje" (line) which is too aggressive. I see that it soon gets complicated to try to be clever. Should we go back to the one-pass again for the light stemmer? Christian? > Create a NorwegianLightStemmer and NorwegianMinimalStemmer > -- > > Key: SOLR-2764 > URL: https://issues.apache.org/jira/browse/SOLR-2764 > Project: Solr > Issue Type: New Feature > Components: Schema and Analysis >Reporter: Jan Høydahl > Fix For: 3.6, 4.0 > > Attachments: SOLR-2764.patch, SOLR-2764.patch, SOLR-2764.patch, > SOLR-2764.patch > > > We need a simple light-weight stemmer and a minimal stemmer for > plural/singlular only in Norwegian -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2764) Create a NorwegianLightStemmer and NorwegianMinimalStemmer
[ https://issues.apache.org/jira/browse/SOLR-2764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13195863#comment-13195863 ] Robert Muir commented on SOLR-2764: --- just some general suggestions: in a light stemmer, i would be wary of derivational endings. it seems in the case of dom/het because its dealing with adj/noun that its on the edge (maybe ok here), but if possible it would be more ideal to avoid multiple passes... this is the kind of thing that causes snowball problems. Can you think of examples for dom/het where the meaning would be changed? for example: "freedom" is used the same way in english, but stemming this to "free" is very lossy, since free has a variety of meanings (such as costs nothing), some of which are incompatible with "freedom". This is the danger of stripping derivational suffixes... > Create a NorwegianLightStemmer and NorwegianMinimalStemmer > -- > > Key: SOLR-2764 > URL: https://issues.apache.org/jira/browse/SOLR-2764 > Project: Solr > Issue Type: New Feature > Components: Schema and Analysis >Reporter: Jan Høydahl > Fix For: 3.6, 4.0 > > Attachments: SOLR-2764.patch, SOLR-2764.patch, SOLR-2764.patch, > SOLR-2764.patch > > > We need a simple light-weight stemmer and a minimal stemmer for > plural/singlular only in Norwegian -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2764) Create a NorwegianLightStemmer and NorwegianMinimalStemmer
[ https://issues.apache.org/jira/browse/SOLR-2764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13195451#comment-13195451 ] Christian Moen commented on SOLR-2764: -- I added a few entries to the tests, including some irregular ones, to validate and illustrate how the stemmer works in these cases. Jan, looks good to me. +1 > Create a NorwegianLightStemmer and NorwegianMinimalStemmer > -- > > Key: SOLR-2764 > URL: https://issues.apache.org/jira/browse/SOLR-2764 > Project: Solr > Issue Type: New Feature > Components: Schema and Analysis >Reporter: Jan Høydahl > Fix For: 3.6, 4.0 > > Attachments: SOLR-2764.patch, SOLR-2764.patch, SOLR-2764.patch > > > We need a simple light-weight stemmer and a minimal stemmer for > plural/singlular only in Norwegian -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org