Re: Indexing multiple languages

Erik Hatcher Tue, 31 May 2005 16:13:45 -0700

Robert,

I'm very likely going to be using DSpace and some relatedtechnologies from the SIMILE project very soon :)



On May 31, 2005, at 5:08 PM, Tansley, Robert wrote:

Hi all,

The DSpace (www.dspace.org) currently uses Lucene to index metadata
(Dublin Core standard) and extracted full-text content of documents
stored in it.  Now the system is being used globally, it needs to
support multi-language indexing.

I've looked through the mailing list archives etc. and it seems it's
easy to plug in analyzers for different languages.

What if we're trying to index multiple languages in the same site?  Is
it best to have:

1/ one index for all languages

2/ one index for all languages, with an extra language field sosearches

can be constrained to a particular language
3/ separate indices for each language?

I would vote for option #2 as it gives the most flexibilty - you canquery with or without concern for language.

I'm also not sure of the storage and performance consequences of 2/.


Adding an additional field will be of little consequence.

Approach 3/ seems like it might be the most complex from an
implementation/code point of view.

I don't think #3 is all that complex to implement beyond the otheroptions, except if you want to search across all languages - but theMultiSearcher can handle that.

Does anyone have any thoughts or recommendations on this?

It's tough to give a general recommendation - it really depends onhow each of these solutions fit into the architecture and what needsyou have in terms of querying across multiple languages and such.


    Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Indexing multiple languages

Reply via email to