Hi Erik, I am a new comer to this list and please allow me to ask a dumb question.
For the StandardAnalyzer, will it have to be modified to accept different character encodings. We have customers in China, Taiwan and Hong Kong. Chinese data may come in 3 different encoding: Big5, GB and UTF8. What is the default encoding for the StandardAnalyser. Btw, I did try running the lucene demo (web template) to index the HTML files after I added one including English and Chinese characters. I was not able to search for any Chinese in that HTML file (returned no hits). I wonder whether I need to change some of the java programs to index Chinese and/or accept Chinese as search term. I was able to search for the HTML file if I used English word that appeared in the added HTML file. Thanks, Bob On May 31, 2005, Erik wrote: Jian - have you tried Lucene's StandardAnalyzer with Chinese? It will keep English as-is (removing stop words, lowercasing, and such) and separate CJK characters into separate tokens also. Erik On May 31, 2005, at 5:49 PM, jian chen wrote: > Hi, > > Interesting topic. I thought about this as well. I wanted to index > Chinese text with English, i.e., I want to treat the English text > inside Chinese text as English tokens rather than Chinese text tokens. > > Right now I think maybe I have to write a special analyzer that takes > the text input, and detect if the character is an ASCII char, if it > is, assembly them together and make it as a token, if not, then, make > it as a Chinese word token. > > So, bottom line is, just one analyzer for all the text and do the > if/else statement inside the analyzer. > > I would like to learn more thoughts about this! > > Thanks, > > Jian > > On 5/31/05, Tansley, Robert <[EMAIL PROTECTED]> wrote: > >> Hi all, >> >> The DSpace (www.dspace.org) currently uses Lucene to index metadata >> (Dublin Core standard) and extracted full-text content of documents >> stored in it. Now the system is being used globally, it needs to >> support multi-language indexing. >> >> I've looked through the mailing list archives etc. and it seems it's >> easy to plug in analyzers for different languages. >> >> What if we're trying to index multiple languages in the same >> site? Is >> it best to have: >> >> 1/ one index for all languages >> 2/ one index for all languages, with an extra language field so >> searches >> can be constrained to a particular language >> 3/ separate indices for each language? >> >> I don't fully understand the consequences in terms of performance for >> 1/, but I can see that false hits could turn up where one word >> appears >> in different languages (stemming could increase the changes of this). >> Also some languages' analyzers are quite dramatically different (e.g. >> the Chinese one which just treats every character as a separate >> token/word). >> >> On the other hand, if people are searching for proper nouns in >> metadata >> (e.g. "DSpace") it may be advantageous to search all languages at >> once. >> >> >> I'm also not sure of the storage and performance consequences of 2/. >> >> Approach 3/ seems like it might be the most complex from an >> implementation/code point of view. >> >> Does anyone have any thoughts or recommendations on this? >> >> Many thanks, >> >> Robert Tansley / Digital Media Systems Programme / HP Labs >> http://www.hpl.hp.com/personal/Robert_Tansley/ >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]