Hi, Interesting topic. I thought about this as well. I wanted to index Chinese text with English, i.e., I want to treat the English text inside Chinese text as English tokens rather than Chinese text tokens.
Right now I think maybe I have to write a special analyzer that takes the text input, and detect if the character is an ASCII char, if it is, assembly them together and make it as a token, if not, then, make it as a Chinese word token. So, bottom line is, just one analyzer for all the text and do the if/else statement inside the analyzer. I would like to learn more thoughts about this! Thanks, Jian On 5/31/05, Tansley, Robert <[EMAIL PROTECTED]> wrote: > Hi all, > > The DSpace (www.dspace.org) currently uses Lucene to index metadata > (Dublin Core standard) and extracted full-text content of documents > stored in it. Now the system is being used globally, it needs to > support multi-language indexing. > > I've looked through the mailing list archives etc. and it seems it's > easy to plug in analyzers for different languages. > > What if we're trying to index multiple languages in the same site? Is > it best to have: > > 1/ one index for all languages > 2/ one index for all languages, with an extra language field so searches > can be constrained to a particular language > 3/ separate indices for each language? > > I don't fully understand the consequences in terms of performance for > 1/, but I can see that false hits could turn up where one word appears > in different languages (stemming could increase the changes of this). > Also some languages' analyzers are quite dramatically different (e.g. > the Chinese one which just treats every character as a separate > token/word). > > On the other hand, if people are searching for proper nouns in metadata > (e.g. "DSpace") it may be advantageous to search all languages at once. > > > I'm also not sure of the storage and performance consequences of 2/. > > Approach 3/ seems like it might be the most complex from an > implementation/code point of view. > > Does anyone have any thoughts or recommendations on this? > > Many thanks, > > Robert Tansley / Digital Media Systems Programme / HP Labs > http://www.hpl.hp.com/personal/Robert_Tansley/ > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]