Robert,
I'm very likely going to be using DSpace and some related
technologies from the SIMILE project very soon :)
On May 31, 2005, at 5:08 PM, Tansley, Robert wrote:
Hi all,
The DSpace (www.dspace.org) currently uses Lucene to index metadata
(Dublin Core standard) and extracted full-text content of documents
stored in it. Now the system is being used globally, it needs to
support multi-language indexing.
I've looked through the mailing list archives etc. and it seems it's
easy to plug in analyzers for different languages.
What if we're trying to index multiple languages in the same site? Is
it best to have:
1/ one index for all languages
2/ one index for all languages, with an extra language field so
searches
can be constrained to a particular language
3/ separate indices for each language?
I would vote for option #2 as it gives the most flexibilty - you can
query with or without concern for language.
I'm also not sure of the storage and performance consequences of 2/.
Adding an additional field will be of little consequence.
Approach 3/ seems like it might be the most complex from an
implementation/code point of view.
I don't think #3 is all that complex to implement beyond the other
options, except if you want to search across all languages - but the
MultiSearcher can handle that.
Does anyone have any thoughts or recommendations on this?
It's tough to give a general recommendation - it really depends on
how each of these solutions fit into the architecture and what needs
you have in terms of querying across multiple languages and such.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]