Snowball is a per language stemmer, this is why you will see classes such as DutchStemmer.cs, FinnishStemmer.cs, German2Stemmer.cs, ItalianStemmer.cs, etc.
-- George > -----Original Message----- > From: Torsten Rendelmann [mailto:[EMAIL PROTECTED] > Sent: Tuesday, April 03, 2007 2:39 AM > To: [email protected] > Subject: RE: Best use of language dep. analyzers? > > George, > > Yes Snowball was in my mind as I wrote my post. > My understanding of that was it does provide a general way to > analyze, not providing one analyzer for each language. > I'm wrong? > > If I only would have enough spare time to have a look, I > would like to help with that (porting our current code using > per language analyzers and track down issues). > > Torsten > > > -----Original Message----- > > From: George Aroush [mailto:[EMAIL PROTECTED] > > Sent: Tuesday, April 03, 2007 2:40 AM > > To: [email protected] > > Subject: RE: Best use of language dep. analyzers? > > > > Hi Torsten, > > > > Are you referring to the analyzer in Snowball.Net? I ported those > > analyzer to C# however, since I lack the language > understanding, and > > those analyzers don't come with a JUnit to port and test in the C# > > land, I can't confirm if the port is valid or not. This is > the case > > for 1.9 as well as for 2.0, I'm afraid it will remain the > case unless > > if someone with langue knowledge debugged them. > > > > -- George Aroush > > > > -----Original Message----- > > From: Torsten Rendelmann [mailto:[EMAIL PROTECTED] > > Sent: Saturday, March 31, 2007 11:52 AM > > To: [email protected] > > Subject: Best use of language dep. analyzers? > > > > Hi, I'm not so familiar with the lucene (Java) direction of dev. in > > the field of language dependent analyzers. What will it be? > > > > We use a slightly modified version of 1.9 lucene.net (wich > include the > > yet published/converted language dep. analyzers - various folders > > below "Analysis" named "BR", "CJK", "FR", "DE" etc.). As far I > > understand they should be used to analyze language specific > > documents/texts and get rid of stop words, etc. - so provide the > > "real" text to index. So currently we detect/get the > language out of > > the documents we index, transform them to create the > "right" analyzer > > and add the document. > > But they are not stable, we got various problems using them > (endless > > loops, empty string in a stop word table just to name some). > > > > Will this be the same for lucene.net 2.x ? What "language" > > package will be > > available? > > Will it be part of the apache project? > > > > Thx, > > Torsten Rendelmann > > > >
