Hi, I plan to use lucene to index documents in multiple languages (ie. each document in more than one European language) as follows.
Index: - Before indexing find the language of the document (using Nutch's Language Identifier) - Use the Analyzer for that language to index the document. Analyzer will be constructed with stopwords for that language. Stemming will NOT be used for any language. - All the documents go to one single index. - Remember all the languages encountered while creating the index. Search: - Get the superset of stopwords by merging the stopwords from all the languages. - Create an Analyzer with this list of stopwords - Use this analyzer for all the search queries I have read that one should use the same analyzer during search as the one used to create the index. I am clearly deviating from this rule. But since I am not using any language-specific filter, this looks correct to me. (If in future need arises to restrict results from a particular language, I plan to add another field in each document for language and use it in the query.) * While getting the details right, am I falling to a grand fallacy? Is there any basic assumption in my thinking which is patently wrong? * Curious question: Support for CJK - Since StandardAnalyzer() is good enough for major European languages, I can use a different index for CJK built with a CJK analyzer, or potentially different for each of C, J and K. To make things simple, let's say only one of these indices will be used to search at a time (so as to avoid complications of merging results from multiple indices). Is this solution correct? Thanks in advance. --shashi -- "Speed is subsittute fo accurancy." --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]