This won't be *really* helpful, but I remember this being discussed at some length a while ago. You'd be able to see some good info if you searched the list archive, probably for language
I didn't pay much attention since this isn't something I'm concerned with lately, so I can't be much real help... Best Erick On 10/13/06, Antony Bowesman <[EMAIL PROTECTED]> wrote:
Hello, I'm new to Lucene and wanted some advice on analyzers, stemmers and language analysis. I've got LIA, so have read it's chapters. I am writing a framework that needs to be able to index documents from a range of languages where just the character set of the document is known. Has anyone looked at or is using language analysis to determine the language of a document in ISO-8859-1. Is it worth doing or does StandardAnalyzer cope well with most European languages as long as it is provided with a suitable multi-lingual set of stop words. What about stemming? I see Google now says it does stemming, but again here language detection seems to be a stumbling block in the way of choosing the right stemmer. Does stemming provide much of an index size reduction and is it actually useful in search? Antony --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]