Tansley, Robert wrote:

Hi all,

The DSpace (www.dspace.org) currently uses Lucene to index metadata
(Dublin Core standard) and extracted full-text content of documents
stored in it.  Now the system is being used globally, it needs to
support multi-language indexing.

I've looked through the mailing list archives etc. and it seems it's
easy to plug in analyzers for different languages.

What if we're trying to index multiple languages in the same site?  Is
it best to have:

1/ one index for all languages
2/ one index for all languages, with an extra language field so searches
can be constrained to a particular language
3/ separate indices for each language?

I don't fully understand the consequences in terms of performance for
1/, but I can see that false hits could turn up where one word appears
in different languages (stemming could increase the changes of this).
Also some languages' analyzers are quite dramatically different (e.g.
the Chinese one which just treats every character as a separate
token/word).
On the other hand, if people are searching for proper nouns in metadata
(e.g. "DSpace") it may be advantageous to search all languages at once.


I'm also not sure of the storage and performance consequences of 2/.

Approach 3/ seems like it might be the most complex from an
implementation/code point of view.
But this will be the most robust solution. You have to differentiate between languages anyway, and as you pointed here, you can differentiate by adding a Keyword field for language, or you can create different
indexes.

If you need to use complex search strings over multiple fields and indexes then I recommend you to use the QueryParser to compute the search string. When you instantiate a QueryPArser you will need to provide an analyzer, that will be different
for different languages.

I think that the differences in performance won't be noticable between 2nd and 3rd solutions, but from maintenance point of
view, I would choose the third solution.

Of course there are other factors that must be take in account when designing such an application: number of documents to be indexed, number of document fields, index change frequency, server load (number of concurrent sessions), etc.

Hope this hints help you a little,

Best,

Sergiu



Does anyone have any thoughts or recommendations on this?

Many thanks,

Robert Tansley / Digital Media Systems Programme / HP Labs
 http://www.hpl.hp.com/personal/Robert_Tansley/

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to