Re: Indexing multiple languages

Andy Roberts Fri, 03 Jun 2005 01:42:40 -0700

On Friday 03 Jun 2005 01:06, Bob Cheung wrote:
> For the StandardAnalyzer, will it have to be modified to accept
> different character encodings.
>
> We have customers in China, Taiwan and Hong Kong.  Chinese data may come
> in 3 different encoding:  Big5, GB and UTF8.
>
> What is the default encoding for the StandardAnalyser.


The analysers themselves do not worry about encodings, per se. Java uses 
Unicode strings throughout, which is adequate enough to describing all 
languages.  When reading in text files, it's a matter of letting the reader 
know which encoding the file is in, this helps Java to read in the text, and 
essentially map that encoding to the Unicode encoding. All the string 
operations, like analysing are done on these Unicode strings.

So, the task is making sure the file reader you use to open a document for 
indexing is given the required information for correctly decoding your file. 
If you don't specify, Java will use one based on the locale that your OS 
uses. For me, that's Latin1 as I'm in Britain. This clearly is inadequate for 
non-Latin texts and wouldn't be able to read in Chinese texts properly as the 
Latin1 encoding doesn't support such characters. You need to specify Big5 
yourself. Read the info on InputStreamReaders:

http://java.sun.com/j2se/1.5.0/docs/api/java/io/InputStreamReader.html

Andy

>
> Btw, I did try running the lucene demo (web template) to index the HTML
> files after I added one including English and Chinese characters.  I was
> not able to search for any Chinese in that HTML file (returned no hits).
> I wonder whether I need to change some of the java programs to index
> Chinese and/or accept Chinese as search term.  I was able to search for
> the HTML file if I used English word that appeared in the added HTML
> file.
>
> Thanks,
>
> Bob
>
>
> On May 31, 2005, Erik wrote:
>
> Jian - have you tried Lucene's StandardAnalyzer with Chinese?  It
> will keep English as-is (removing stop words, lowercasing, and such)
> and separate CJK characters into separate tokens also.
>
>      Erik
>
> On May 31, 2005, at 5:49 PM, jian chen wrote:
> > Hi,
> >
> > Interesting topic. I thought about this as well. I wanted to index
> > Chinese text with English, i.e., I want to treat the English text
> > inside Chinese text as English tokens rather than Chinese text tokens.
> >
> > Right now I think maybe I have to write a special analyzer that takes
> > the text input, and detect if the character is an ASCII char, if it
> > is, assembly them together and make it as a token, if not, then, make
> > it as a Chinese word token.
> >
> > So, bottom line is, just one analyzer for all the text and do the
> > if/else statement inside the analyzer.
> >
> > I would like to learn more thoughts about this!
> >
> > Thanks,
> >
> > Jian
> >
> > On 5/31/05, Tansley, Robert <[EMAIL PROTECTED]> wrote:
> >> Hi all,
> >>
> >> The DSpace (www.dspace.org) currently uses Lucene to index metadata
> >> (Dublin Core standard) and extracted full-text content of documents
> >> stored in it.  Now the system is being used globally, it needs to
> >> support multi-language indexing.
> >>
> >> I've looked through the mailing list archives etc. and it seems it's
> >> easy to plug in analyzers for different languages.
> >>
> >> What if we're trying to index multiple languages in the same
> >> site?  Is
> >> it best to have:
> >>
> >> 1/ one index for all languages
> >> 2/ one index for all languages, with an extra language field so
> >> searches
> >> can be constrained to a particular language
> >> 3/ separate indices for each language?
> >>
> >> I don't fully understand the consequences in terms of performance for
> >> 1/, but I can see that false hits could turn up where one word
> >> appears
> >> in different languages (stemming could increase the changes of this).
> >> Also some languages' analyzers are quite dramatically different (e.g.
> >> the Chinese one which just treats every character as a separate
> >> token/word).
> >>
> >> On the other hand, if people are searching for proper nouns in
> >> metadata
> >> (e.g. "DSpace") it may be advantageous to search all languages at
> >> once.
> >>
> >>
> >> I'm also not sure of the storage and performance consequences of 2/.
> >>
> >> Approach 3/ seems like it might be the most complex from an
> >> implementation/code point of view.
> >>
> >> Does anyone have any thoughts or recommendations on this?
> >>
> >> Many thanks,
> >>
> >>  Robert Tansley / Digital Media Systems Programme / HP Labs
> >>   http://www.hpl.hp.com/personal/Robert_Tansley/
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Indexing multiple languages

Reply via email to