Single Analyzer for multiple European languages

Shashikant Kore Mon, 26 Sep 2005 10:51:22 -0700

Hi,

I plan to use lucene to index documents in multiple languages (ie.
each document in more than one European language) as follows.


Index:
- Before indexing find the language of the document (using Nutch's
Language Identifier)
- Use the Analyzer for that language to index the document. Analyzer
will be constructed with stopwords for that language. Stemming will
NOT be used for any language.
- All the documents go to one single index.
- Remember all the languages encountered while creating the index.

Search:
- Get the superset of stopwords by merging the stopwords from all the languages.
- Create an Analyzer with this list of stopwords
- Use this analyzer for all the search queries


I have read that one should use the same analyzer during search as the
one used to create the index.  I am clearly deviating from this rule.
But since I am not using any  language-specific filter, this looks
correct to me. (If in future need arises to restrict results from a
particular language, I plan to add another field in each document for
language and use it in the query.)

*  While getting the details right, am I falling to a grand fallacy?
Is there any basic assumption in my thinking which is patently wrong?

* Curious question: Support for CJK - Since StandardAnalyzer() is good
enough for major European languages, I can use a different index for
CJK built with a CJK analyzer,  or potentially different for each of
C, J and K. To make things simple, let's say only one of these indices
will be used to search at a time (so as to avoid complications of
merging results from multiple indices). Is this solution correct?

Thanks in advance.

--shashi

--
"Speed is subsittute fo accurancy."

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Single Analyzer for multiple European languages

Reply via email to