Re: Multiple Language Indexing and Searching

Olivier Jaquemet Tue, 06 Sep 2005 00:30:13 -0700

As far as your usage is concerned, it seems to be the right approach,and I think the StandardAnalyzer does the job pretty right when it hasto deal with whatever language you want.Though, note that it won't deal with all languages' stop words but theEnglish ones, unless specified at index time But then if you change thestop words at index time, what should you use at query time, some queryit won't work well.

But as far as I am concerned, each content (content in the sense of aCMS) is known to have multiple language, and each of these language*can* be indexed separately with no problem at all, and therefore adedicated analyser could be use. So I was wondering whether my approachcould be the right one of if it was over complex, and could introducesome problem I could not see... (My approach being: one index per language)

Advantages are:

- you always have the same analyzer for one index, so if want to benefitfrom some indexing capabilities in one language (stemmer, filter..whatever), you can!- Should you need to search in all the language, you just need to do thequery on every single index and you still benefit from each analyzer.

Inconvenients are

- You have to deal with as much indices as you have languages, but thenagain, if you do a search in only one language, it becomes anperformance advantage I think.- You have to merge results from different index, this is a problemewhen dealing with score, any suggestions?- Unless i'm wrong, you cannot use a MultipleSeacher, because only oneanalyzer can be specified, and not one analyzer per searcher (if someonecould correct me if I'm wrong..)

- others ??

I don't know if the developpers of lucene would agree, but from whatI've been browsing on the ML archives, those multiple language issuesseems to arrise quite often in the mailing list, and maybe some articleslike "best practices", "do's and don'ts" or "Lucene Architecture inmultiple language environement", might be really nice to see :) If someof you have the time and the experience to write them I'll be reallythankful! :)


Olivier

Hacking Bear wrote:

Hi,
I have the similar problem to deal with. In fact, a lot of times, thedocuments do not have any lanugage information or it may contain text inmultiple languages. Further, the user would not like to always supply thisinformation. Also the user may very well be interested in documents inmultiple language.I think Google and other search engines allow indexing multi-lanugagedocuments. For example, if you google "Java", there will many matcheddocuments in lanugages other than English.The only assumption we can make is that the document text are converted toUnicode before feeding to Lucene.
So I think the solution should be (1) create one index for all lanugage (2)add an advisory attribute like "lang" to specify the language of thedocument; if the language is unknown, just leave it empty or set to "ANY";(3) based on the code pages of the upcoming Unicode character, weautomatically switch among different analyzers to index the fragments of thetext; (4) during search, unless the user explicitly requesting documents incertain language, we return all matches regardless of lanugage.I have browsed through the Lucene and contributed source codes, but Icannot tell which analyzer is suitable for use (in (3).) While the logic forsuch an analyzer is probably not too complicate, it seems to demand quitesome Unicode knowledge to create one.
Is my approach the right one? Is there an analyzer suitable to use?
Thanks.
- HB
On 9/5/05, Olivier Jaquemet <[EMAIL PROTECTED]> wrote:
Hi,

I'd like to go in details regarding issues that occurs when you want to
index and search contents in multiple languages.

I have read Lucene in Action book, and many thread on this mailing list,
the most interesting so far being this one:

http://mail-archives.apache.org/mod_mbox/lucene-java-user/200505.mbox/[EMAIL 
PROTECTED]

The solution choosen/recommended by Doug Cutting in this message:

http://mail-archives.apache.org/mod_mbox/lucene-java-user/200506.mbox/[EMAIL 
PROTECTED]
is the number '2/':
Having one index for all languages one Document per content's language
with a field specify its language, and using a query filter whensearching.
While I think it is a good solution:
- If you have N languages, if you search for something in 1 language,
you are going to search an index N times too large.
Wouldn't it be better to have N indices for N languages? That way, each
index could benefit of its specialized analyser, and if you need to
search in multiple languages, you just need to merge result of those
differents analyzer.
- If you have contents in multiple language like we do, and by that I
don't mean multiple contents each one having its own language, but
multiple content, each one being in many languages. You are going to
have a N to 1, Document/content relation in the index.
As far as update, delete, and search in multiple language are concerned,
wouldn't it be simpler to alway keep a 1 to 1 Document/content relation
in an index?

As you may have guess, my original thought, even before I read those
thread, was that the solution number 3. might be more flexible/modular
than the others, of course it also has its drawbacks:
- performance issue when doing multiple language search, specially when
merging results of different index.
- more complex to code
- other?

Can you clarify on this?
What solutions all of you have choosen til now regarding indexing and
searching of multiple content in multiple language ?

Thanks!

Olivier



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Olivier Jaquemet <[EMAIL PROTECTED]>
Ingénieur R&D Jalios S.A.
Tel: 01.39.23.92.83
http://www.jalios.com/
http://support.jalios.com/




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Multiple Language Indexing and Searching

Reply via email to