Multiple languages - possible approach

Paul Cowan Wed, 15 Mar 2006 19:50:23 -0800

Hi everyone,

We are currently using Lucene to index correspondence between variouspeople, who may or may not use the same language in their discussions toeach other. Think an email system where participants might use thelanguage that seems most appropriate to the thought at the time, just asthey would in conversation.


An example (CN = some chinese text. Use your imagination!):

        From: Someone in the UK
        To: Someone in China
        Subject: Re: CNCNCNCNCNCNCNCNCNCNCN

        > CNCNCNCNCNCNCNCN

        Yes, I think that's fine. I'm OK with that as long as Bob is.

        > CNCNCNCNCNCN

        CNCN?

        > Tuesday OK?

        I need it by Monday, sorry. CNCN!

We need to index that, and be able to search on it -- for both theChinese and English text. Note that stemming is not a particular need ofours -- we're happy to search for literal tokens, but of course that maynot apply to other languages where stemming is expected behaviour, notjust a 'nicety'.

Anyway: so far, fine -- StandardAnalyzer is perfectly suitable to ourneeds. The problem is, the next language out of the starting blocks isArabic, which StandardAnalyzer doesn't seem to be up to.

I've looked into previous discussions about this on the various lists,and it seems to me there are a few options:

1) Maintain multiple indexes (StandardAnalyzer-analyzed,ArabicAnalyzer-analyzed, LanguageXXXAnalyzer-analyzed) and search acrossall of them, merging results


2) Maintain multiple indexes, ask the user which one to use at search-time:
        Search for the [Arabic \/] text: [______________________]

3) Use StandardAnalyzer and hope for the best.

4) Write a new... "Super Analyzer" that tries to deal with this. This isPOSSIBLY the best idea -- and, of course, almost certainly the hardest!

Basically, what we're considering is writing some sort of newCompositeAnalyzer class which applies the following algorithm (in verysimple terms):


a) Start reading the stream

b) Look at the next character

c) Use some sort of Character.UnicodeBlock (or Character.Subsetgenerally) -> Analyzer mapping to work out which Analyzer we want touse. e.g. find a member of Character.UnicodeBlock.GREEK, load aGreekAnalyzer.

d) Keep reading until we hit something that makes us think we need tochange analyzers (either end-of-stream or something incongruous -- e.g.something from Character.UnicodeBlock.CYRILLIC). Then bundle up whatwe've got, hand it to the GreekAnalyzer, and then start the processagain with a RussianAnalyzer (or whatever).

Obviously the best way to do this would be to have these mappingsdynamic, not set in stone -- some people might like allCJK_COMPATABILITY to be handed to the CJKAnalyzer, some to theChineseAnalyzer, some might like to use their own, etc. Of coursethere's no reason default mappings can't be supplied.

I guess the basic question is -- what does everyone think? Is thisuseful/workable/are there any fatal flaws with it? Obviously the biggieis that sometimes Unicode ranges are not sufficient to determine whichanalyzer to use -- for example, we may want to specifically use theGermanAnalyzer for German text, but that is basically impossible to tellfrom English purely based on the Unicode block of the next character. Atleast this way, though, we'd have the OPTION of farming off to morespecific Analyzers based on Character set; being able to have anAnalyzer which can tell Urdu from Arabic is something of separate issue;at least the "CompositeAnalyzer" would bring us a bit closer to thegoal. It may be rudimentary but I think the 'pluggable' architecturecould be useful -- certainly more useful in our case than just runningthe StandardAnalyzer over everything.

If this project goes ahead, it's possible (even likely) that it would becontributed back to the Lucene sandbox. As such, I'm very interested tohear about any suggestions, criticisms, or other feedback you might have.


Cheers,

Paul Cowan

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Multiple languages - possible approach

Reply via email to