Designing a multilingual index

David Vergnaud Wed, 31 Mar 2010 09:20:57 -0700

Hi everyone!

I'm about to build a search engine that will handle documents in several 
languages (4 for now but the number will increase in the near future). In order 
to index them properly and offer the best user experience, I'm automatically 
recognizing the language of each document in order to be able to use 
appropriate language-dependent analysis, so that e.g. "the" is recognized as a 
stopword in an English document, but is interpreted as potentially meaning 
"thé" in a French document. I also want stemming rules to be applied depending 
on the document's language, so that a search for "bank" matches "Banken" in a 
German document and "banks" in an English document -- and not the other way 
round.


Now I've thought of two possible architectures to achieve this, perhaps some 
experienced Lucene user might give me some advice on which approach would be 
better -- or is there yet another one which would be even more appropriate? 

The first method I'm thinking of, and I've already partially implemented, is to 
have several indices -- one per language. Upon recognizing the language of a 
document, a flag is set in my internal data structure, and this is used by the 
indexing module to decide which index the document should be added to. There is 
an additional index for documents where no language could be recognized -- in 
that case, a simple tokenizer is used. 
As I see it, one advantage of this approach is that all indices share the same 
structure, which makes it easier to build queries. Also, they can all be 
searched parallely, but I'm not sure that this is a great advantage. However, 
one drawback might be that searching several smaller indices might not be as 
efficient as searching one big index containing all documents. Besides, I'm 
planning on improving the processing to allow a single document to be assigned 
several languages (e.g. paragraph-based recognition), and this architecture 
would mean that, in order to properly analyze the parts, a multilingual 
document would have to be split between several indices, therefore either 
repeating common information (like e.g. document name) or having yet another 
index containing language-independent information. This might quickly become 
rather cumbersome to manage. 

The second method I've thought of is to have all languages in the same index 
and use different analyzers on fields that require analysis. In order to do 
that, I was thinking of extending the names of the fields with the names of the 
languages -- like e.g. "content-en" vs "content-fr" vs "content-xx" (for "no 
language recognized"). Then using a customized analyzer, the name of the field 
would be parsed in method tokenStream and the proper language-dependent 
analyzer would be selected. 
The drawback of this method, as I see it, is that the number of fields in the 
index increases drastically, which in turn means that building queries becomes 
rather cumbersome -- but still doable, assuming (which also is the case) that I 
know the exact list of languages I'm dealing with. Also, it means that Lucene 
would be searching in non-existing fields in most documents, as I doubt many of 
them would contain *all* languages. But it keeps the complete information about 
one document gathered in one place and requires searching only one index. 

As I said, I've already implemented the first method some time ago and it works 
fine. I've only just thought about the second one when I read about this 
PerFieldAnalyzerWrapper, which allows to do just what I want in the second 
method. Since my index won't be that big at first, I doubt either architecture 
would prove to be much more efficient than the other, however I want to use a 
scaleable design right from the start, so I was wondering whether some Lucene 
gurus might give me some insights as to what in their eyes would be the better 
approach -- or whether there might be a different, much better technique I 
haven't thought of. 

Thanks a lot in advance for your support and ideas!

David




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Designing a multilingual index

Reply via email to