Hello Antony,
I have a similar problem. My collection contains mainly German
documents, but some in English and few in French, Spain and Latin. I
know that each language has its own stemming rules.
Language detection is not my domain. But I can imagine it could be
possible to detect the language of a document by statistics methods like
character based n-grams. "Ä", "ö", "ü", "ß" are quite often used in
German words, “th” could indicate English and so on. It is probably more
complex. Matching stop words of a language in a document could be
another or additional way. How ever, let’s say I can detect the language
of a document. Than I would use an analyzer or stemmer in the language
of the document.
Now I see two other problems. Quite often you will find mainly English
terms in non English documents. You will use for these terms the wrong
analyzer. Another problem is the query. You should use the same analyzer
for indexing the documents and parsing the queries. The query is usually
to short for statistical methods, and you will find stop words in a
query not so often.
So I decide for my task to use one analyzer for all documents and the
queries. I use the stemmer of the most probably language of my
documents. That is not perfect but should be OK.
Sören
Antony Bowesman wrote:
Hello,
I'm new to Lucene and wanted some advice on analyzers, stemmers and
language analysis. I've got LIA, so have read it's chapters.
I am writing a framework that needs to be able to index documents from a
range of languages where just the character set of the document is
known. Has anyone looked at or is using language analysis to determine
the language of a document in ISO-8859-1.
Is it worth doing or does StandardAnalyzer cope well with most European
languages as long as it is provided with a suitable multi-lingual set of
stop words.
What about stemming? I see Google now says it does stemming, but again
here language detection seems to be a stumbling block in the way of
choosing the right stemmer. Does stemming provide much of an index size
reduction and is it actually useful in search?
Antony
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]