RE: Best use of language dep. analyzers?

George Aroush Tue, 03 Apr 2007 12:14:57 -0700

Snowball is a per language stemmer, this is why you will see classes such as
DutchStemmer.cs, FinnishStemmer.cs, German2Stemmer.cs, ItalianStemmer.cs,
etc.


-- George

> -----Original Message-----
> From: Torsten Rendelmann [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, April 03, 2007 2:39 AM
> To: [email protected]
> Subject: RE: Best use of language dep. analyzers?
> 
> George,
> 
> Yes Snowball was in my mind as I wrote my post.
> My understanding of that was it does provide a general way to 
> analyze, not providing one analyzer for each language.
> I'm wrong?
> 
> If I only would have enough spare time to have a look, I 
> would like to help with that (porting our current code using 
> per language analyzers and track down issues).
> 
> Torsten
> 
> > -----Original Message-----
> > From: George Aroush [mailto:[EMAIL PROTECTED]
> > Sent: Tuesday, April 03, 2007 2:40 AM
> > To: [email protected]
> > Subject: RE: Best use of language dep. analyzers?
> > 
> > Hi Torsten,
> > 
> > Are you referring to the analyzer in Snowball.Net?  I ported those 
> > analyzer to C# however, since I lack the language 
> understanding, and 
> > those analyzers don't come with a JUnit to port and test in the C# 
> > land, I can't confirm if the port is valid or not.  This is 
> the case 
> > for 1.9 as well as for 2.0, I'm afraid it will remain the 
> case unless 
> > if someone with langue knowledge debugged them.
> > 
> > -- George Aroush
> > 
> > -----Original Message-----
> > From: Torsten Rendelmann [mailto:[EMAIL PROTECTED]
> > Sent: Saturday, March 31, 2007 11:52 AM
> > To: [email protected]
> > Subject: Best use of language dep. analyzers?
> > 
> > Hi, I'm not so familiar with the lucene (Java) direction of dev. in 
> > the field of language dependent analyzers. What will it be?
> >  
> > We use a slightly modified version of 1.9 lucene.net (wich 
> include the 
> > yet published/converted language dep. analyzers - various folders 
> > below "Analysis" named "BR", "CJK", "FR", "DE" etc.). As far I 
> > understand they should be used to analyze language specific 
> > documents/texts and get rid of stop words, etc. - so provide the 
> > "real" text to index. So currently we detect/get the 
> language out of 
> > the documents we index, transform them to create the 
> "right" analyzer 
> > and add the document.
> > But they are not stable, we got various problems using them 
> (endless 
> > loops, empty string in a stop word table just to name some).
> >  
> > Will this be the same for lucene.net 2.x ? What "language" 
> > package will be
> > available?
> > Will it be part of the apache project?
> >  
> > Thx,
> > Torsten Rendelmann
> >  
> 
>

RE: Best use of language dep. analyzers?

Reply via email to