On Oct 13, 2006, at 3:42 AM, Antony Bowesman wrote:
I am writing a framework that needs to be able to index documents from a range of languages where just the character set of the document is known. Has anyone looked at or is using language analysis to determine the language of a document in ISO-8859-1.
There is a language identifier plugin in the Nutch codebase that could surely be distilled (and there are plans to do so) into a standalone library:
<http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/ languageidentifier/>
What about stemming? I see Google now says it does stemming, but again here language detection seems to be a stumbling block in the way of choosing the right stemmer. Does stemming provide much of an index size reduction and is it actually useful in search?
Stemming shouldn't be considered for reducing index size, but rather to improve a users experience in findability. It is quite useful in the right situations, but it is not something that all projects desire, so you'd have to see if it fits your needs specifically.
Erik --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]