Robert,

I'm very likely going to be using DSpace and some related technologies from the SIMILE project very soon :)


On May 31, 2005, at 5:08 PM, Tansley, Robert wrote:
Hi all,

The DSpace (www.dspace.org) currently uses Lucene to index metadata
(Dublin Core standard) and extracted full-text content of documents
stored in it.  Now the system is being used globally, it needs to
support multi-language indexing.

I've looked through the mailing list archives etc. and it seems it's
easy to plug in analyzers for different languages.

What if we're trying to index multiple languages in the same site?  Is
it best to have:

1/ one index for all languages
2/ one index for all languages, with an extra language field so searches
can be constrained to a particular language
3/ separate indices for each language?

I would vote for option #2 as it gives the most flexibilty - you can query with or without concern for language.

I'm also not sure of the storage and performance consequences of 2/.

Adding an additional field will be of little consequence.

Approach 3/ seems like it might be the most complex from an
implementation/code point of view.

I don't think #3 is all that complex to implement beyond the other options, except if you want to search across all languages - but the MultiSearcher can handle that.

Does anyone have any thoughts or recommendations on this?

It's tough to give a general recommendation - it really depends on how each of these solutions fit into the architecture and what needs you have in terms of querying across multiple languages and such.

    Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to