Hello,
I'm new to Lucene and wanted some advice on analyzers, stemmers and language
analysis. I've got LIA, so have read it's chapters.
I am writing a framework that needs to be able to index documents from a range
of languages where just the character set of the document is known. Has anyone
looked at or is using language analysis to determine the language of a document
in ISO-8859-1.
Is it worth doing or does StandardAnalyzer cope well with most European
languages as long as it is provided with a suitable multi-lingual set of stop words.
What about stemming? I see Google now says it does stemming, but again here
language detection seems to be a stumbling block in the way of choosing the
right stemmer. Does stemming provide much of an index size reduction and is it
actually useful in search?
Antony
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]