Re: Analyzers and multiple languages

Erik Hatcher Fri, 13 Oct 2006 07:53:07 -0700


On Oct 13, 2006, at 3:42 AM, Antony Bowesman wrote:

I am writing a framework that needs to be able to index documentsfrom a range of languages where just the character set of thedocument is known. Has anyone looked at or is using languageanalysis to determine the language of a document in ISO-8859-1.

There is a language identifier plugin in the Nutch codebase thatcould surely be distilled (and there are plans to do so) into astandalone library:

<http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/languageidentifier/>

What about stemming? I see Google now says it does stemming, butagain here language detection seems to be a stumbling block in theway of choosing the right stemmer. Does stemming provide much ofan index size reduction and is it actually useful in search?

Stemming shouldn't be considered for reducing index size, but ratherto improve a users experience in findability. It is quite useful inthe right situations, but it is not something that all projectsdesire, so you'd have to see if it fits your needs specifically.


        Erik



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Analyzers and multiple languages

Reply via email to