[jira] [Commented] (SOLR-2764) Create a NorwegianLightStemmer and NorwegianMinimalStemmer

2012-03-20 Thread Robert Muir (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233472#comment-13233472
 ] 

Robert Muir commented on SOLR-2764:
---

Very nice work Jan!

 Create a NorwegianLightStemmer and NorwegianMinimalStemmer
 --

 Key: SOLR-2764
 URL: https://issues.apache.org/jira/browse/SOLR-2764
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Reporter: Jan Høydahl
Assignee: Jan Høydahl
 Fix For: 3.6, 4.0

 Attachments: SOLR-2764.patch, SOLR-2764.patch, SOLR-2764.patch, 
 SOLR-2764.patch, SOLR-2764.patch


 We need a simple light-weight stemmer and a minimal stemmer for 
 plural/singlular only in Norwegian

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2764) Create a NorwegianLightStemmer and NorwegianMinimalStemmer

2012-03-13 Thread Commented

[ 
https://issues.apache.org/jira/browse/SOLR-2764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228345#comment-13228345
 ] 

Jan Høydahl commented on SOLR-2764:
---

Will try to prepare a new patch for this when time allows, with one-pass.

 Create a NorwegianLightStemmer and NorwegianMinimalStemmer
 --

 Key: SOLR-2764
 URL: https://issues.apache.org/jira/browse/SOLR-2764
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Reporter: Jan Høydahl
Assignee: Jan Høydahl
 Fix For: 3.6, 4.0

 Attachments: SOLR-2764.patch, SOLR-2764.patch, SOLR-2764.patch, 
 SOLR-2764.patch


 We need a simple light-weight stemmer and a minimal stemmer for 
 plural/singlular only in Norwegian

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2764) Create a NorwegianLightStemmer and NorwegianMinimalStemmer

2012-02-02 Thread Commented

[ 
https://issues.apache.org/jira/browse/SOLR-2764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13199227#comment-13199227
 ] 

Jan Høydahl commented on SOLR-2764:
---

When looking at words enging in -het and -dom in dictionaries (such as Ooo 
nb_NO.dic), the base word has the same meaning in the vast majority of cases. 
But of course there will be exceptions. Take the word brennhet (het as in 
hot), it will be stemmed to brenn - bren which is kind of wrong, but then 
bren is not a valid word so it won't cause errors. There may be such cases 
where the final stem clashes with another word, but not more than the base 
rules. I.e. there is a Norwegian surname Brenna which will be stemmed to 
brenn by the -a rule, believing it's a fem.definite ending, and then we get 
a clash with the verb brenn (burn). And the first name Tore (boy) or Tora 
(girl) will be stemmed to Tor (boy) which is another valid first name...

My hunch is that the -dom/-het rules make more good than wrong. Mainly because 
in the majority of cases it leads to the base word and the -het/-dom word being 
stemmed to the same stem in cases where the -en/-et/-a/-e/-n rule are applied 
wrongly. Example:

{noformat}
One pass   Two passes
forlegenforleg forlegenforleg
forlegenhet forlegen   forlegenhet forleg
forlegenheten   forlegen   forlegenheten   forleg
forlegenhetens  forlegen   forlegenhetens  forleg
firkantet   firkantfirkantet   firkant
firkantethetfirkantet  firkantethetfirkant
firkantetheten  firkantet  firkantetheten  firkant
{noformat}

But I think maybe the rules -dommer and -dommen should be removed, because the 
word dommer (judge) and dommen (the sentence) are both common words valid in 
word endings. So the word linjedommer (linesman) would be stemmed to linje 
(line) which is too aggressive.

I see that it soon gets complicated to try to be clever. Should we go back to 
the one-pass again for the light stemmer? Christian?

 Create a NorwegianLightStemmer and NorwegianMinimalStemmer
 --

 Key: SOLR-2764
 URL: https://issues.apache.org/jira/browse/SOLR-2764
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Reporter: Jan Høydahl
 Fix For: 3.6, 4.0

 Attachments: SOLR-2764.patch, SOLR-2764.patch, SOLR-2764.patch, 
 SOLR-2764.patch


 We need a simple light-weight stemmer and a minimal stemmer for 
 plural/singlular only in Norwegian

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2764) Create a NorwegianLightStemmer and NorwegianMinimalStemmer

2012-02-02 Thread Robert Muir (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13199235#comment-13199235
 ] 

Robert Muir commented on SOLR-2764:
---

Jan, i wasn't trying to be critical about these endings, because of course some 
of the existing light stemmers
have a few _selected_ derivational endings that are taken care of. And thats 
really what its all about,
when we are talking about something like adjective-noun, I didnt mean to say 
we shouldn't do it, because
it sounds quite reasonable: but we should explore the options.

For example, as an alternative to multi-pass, a 'less elegant to some' but 
really practical way to go about it
can be to 'multiply through' and convert the possibilities to single-pass.

E.g. the typical 'undrinkables' hunspell example: if i have the english 
inflectional plural ending -s and the 
derivational ending -able, instead of:
* pass 1: remove inflectional endings (e.g. -s)
* pass 2: remove derivational endings (e.g. -able)

we just take all the pass 2 endings that are compatible with pass 1 endings and 
cross-multiply, to make a single
pass algorithm. some won't be compatible, (so we won't combine -able + -s into 
-ables).

I'm not sure if this is helpful for the norwegian case as I'm not as familiar 
with it, just an idea.


 Create a NorwegianLightStemmer and NorwegianMinimalStemmer
 --

 Key: SOLR-2764
 URL: https://issues.apache.org/jira/browse/SOLR-2764
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Reporter: Jan Høydahl
 Fix For: 3.6, 4.0

 Attachments: SOLR-2764.patch, SOLR-2764.patch, SOLR-2764.patch, 
 SOLR-2764.patch


 We need a simple light-weight stemmer and a minimal stemmer for 
 plural/singlular only in Norwegian

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2764) Create a NorwegianLightStemmer and NorwegianMinimalStemmer

2012-01-29 Thread Robert Muir (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13195863#comment-13195863
 ] 

Robert Muir commented on SOLR-2764:
---

just some general suggestions:

in a light stemmer, i would be wary of derivational endings. 
it seems in the case of dom/het because its dealing with adj/noun that its
on the edge (maybe ok here), but if possible it would be more ideal to
avoid multiple passes... this is the kind of thing that causes snowball 
problems.

Can you think of examples for dom/het where the meaning would be changed?

for example: freedom is used the same way in english, but stemming this 
to free is very lossy, since free has a variety of meanings (such as costs 
nothing), 
some of which are incompatible with freedom. This is the danger of stripping
derivational suffixes...


 Create a NorwegianLightStemmer and NorwegianMinimalStemmer
 --

 Key: SOLR-2764
 URL: https://issues.apache.org/jira/browse/SOLR-2764
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Reporter: Jan Høydahl
 Fix For: 3.6, 4.0

 Attachments: SOLR-2764.patch, SOLR-2764.patch, SOLR-2764.patch, 
 SOLR-2764.patch


 We need a simple light-weight stemmer and a minimal stemmer for 
 plural/singlular only in Norwegian

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2764) Create a NorwegianLightStemmer and NorwegianMinimalStemmer

2012-01-27 Thread Christian Moen (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13195451#comment-13195451
 ] 

Christian Moen commented on SOLR-2764:
--

I added a few entries to the tests, including some irregular ones, to validate 
and illustrate how the stemmer works in these cases.  Jan, looks good to me.  +1

 Create a NorwegianLightStemmer and NorwegianMinimalStemmer
 --

 Key: SOLR-2764
 URL: https://issues.apache.org/jira/browse/SOLR-2764
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Reporter: Jan Høydahl
 Fix For: 3.6, 4.0

 Attachments: SOLR-2764.patch, SOLR-2764.patch, SOLR-2764.patch


 We need a simple light-weight stemmer and a minimal stemmer for 
 plural/singlular only in Norwegian

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org