Hi everyone! I'm about to build a search engine that will handle documents in several languages (4 for now but the number will increase in the near future). In order to index them properly and offer the best user experience, I'm automatically recognizing the language of each document in order to be able to use appropriate language-dependent analysis, so that e.g. "the" is recognized as a stopword in an English document, but is interpreted as potentially meaning "thé" in a French document. I also want stemming rules to be applied depending on the document's language, so that a search for "bank" matches "Banken" in a German document and "banks" in an English document -- and not the other way round.
Now I've thought of two possible architectures to achieve this, perhaps some experienced Lucene user might give me some advice on which approach would be better -- or is there yet another one which would be even more appropriate? The first method I'm thinking of, and I've already partially implemented, is to have several indices -- one per language. Upon recognizing the language of a document, a flag is set in my internal data structure, and this is used by the indexing module to decide which index the document should be added to. There is an additional index for documents where no language could be recognized -- in that case, a simple tokenizer is used. As I see it, one advantage of this approach is that all indices share the same structure, which makes it easier to build queries. Also, they can all be searched parallely, but I'm not sure that this is a great advantage. However, one drawback might be that searching several smaller indices might not be as efficient as searching one big index containing all documents. Besides, I'm planning on improving the processing to allow a single document to be assigned several languages (e.g. paragraph-based recognition), and this architecture would mean that, in order to properly analyze the parts, a multilingual document would have to be split between several indices, therefore either repeating common information (like e.g. document name) or having yet another index containing language-independent information. This might quickly become rather cumbersome to manage. The second method I've thought of is to have all languages in the same index and use different analyzers on fields that require analysis. In order to do that, I was thinking of extending the names of the fields with the names of the languages -- like e.g. "content-en" vs "content-fr" vs "content-xx" (for "no language recognized"). Then using a customized analyzer, the name of the field would be parsed in method tokenStream and the proper language-dependent analyzer would be selected. The drawback of this method, as I see it, is that the number of fields in the index increases drastically, which in turn means that building queries becomes rather cumbersome -- but still doable, assuming (which also is the case) that I know the exact list of languages I'm dealing with. Also, it means that Lucene would be searching in non-existing fields in most documents, as I doubt many of them would contain *all* languages. But it keeps the complete information about one document gathered in one place and requires searching only one index. As I said, I've already implemented the first method some time ago and it works fine. I've only just thought about the second one when I read about this PerFieldAnalyzerWrapper, which allows to do just what I want in the second method. Since my index won't be that big at first, I doubt either architecture would prove to be much more efficient than the other, however I want to use a scaleable design right from the start, so I was wondering whether some Lucene gurus might give me some insights as to what in their eyes would be the better approach -- or whether there might be a different, much better technique I haven't thought of. Thanks a lot in advance for your support and ideas! David --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org