Re: Analyzers and multiple languages

Soeren Pekrul Fri, 13 Oct 2006 06:15:20 -0700

Hello Antony,

I have a similar problem. My collection contains mainly Germandocuments, but some in English and few in French, Spain and Latin. Iknow that each language has its own stemming rules.

Language detection is not my domain. But I can imagine it could bepossible to detect the language of a document by statistics methods likecharacter based n-grams. "Ä", "ö", "ü", "ß" are quite often used inGerman words, “th” could indicate English and so on. It is probably morecomplex. Matching stop words of a language in a document could beanother or additional way. How ever, let’s say I can detect the languageof a document. Than I would use an analyzer or stemmer in the languageof the document.

Now I see two other problems. Quite often you will find mainly Englishterms in non English documents. You will use for these terms the wronganalyzer. Another problem is the query. You should use the same analyzerfor indexing the documents and parsing the queries. The query is usuallyto short for statistical methods, and you will find stop words in aquery not so often.

So I decide for my task to use one analyzer for all documents and thequeries. I use the stemmer of the most probably language of mydocuments. That is not perfect but should be OK.


Sören

Antony Bowesman wrote:

Hello,
I'm new to Lucene and wanted some advice on analyzers, stemmers andlanguage analysis. I've got LIA, so have read it's chapters.
I am writing a framework that needs to be able to index documents from arange of languages where just the character set of the document isknown. Has anyone looked at or is using language analysis to determinethe language of a document in ISO-8859-1.
Is it worth doing or does StandardAnalyzer cope well with most Europeanlanguages as long as it is provided with a suitable multi-lingual set ofstop words.
What about stemming? I see Google now says it does stemming, but againhere language detection seems to be a stumbling block in the way ofchoosing the right stemmer. Does stemming provide much of an index sizereduction and is it actually useful in search?
Antony


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Analyzers and multiple languages

Reply via email to