Multi-lang analyzer? Re: Multiple Language Indexing and Searching

Hacking Bear Mon, 05 Sep 2005 20:15:19 -0700

Hi,
 I have the similar problem to deal with. In fact, a lot of times, the 
documents do not have any lanugage information or it may contain text in 
multiple languages. Further, the user would not like to always supply this 
information. Also the user may very well be interested in documents in 
multiple language.
 I think Google and other search engines allow indexing multi-lanugage 
documents. For example, if you google "Java", there will many matched 
documents in lanugages other than English.
 The only assumption we can make is that the document text are converted to 
Unicode before feeding to Lucene.


So I think the solution should be (1) create one index for all lanugage (2) 
add an advisory attribute like "lang" to specify the language of the 
document; if the language is unknown, just leave it empty or set to "ANY"; 
(3) based on the code pages of the upcoming Unicode character, we 
automatically switch among different analyzers to index the fragments of the 
text; (4) during search, unless the user explicitly requesting documents in 
certain language, we return all matches regardless of lanugage.
 I have browsed through the Lucene and contributed source codes, but I 
cannot tell which analyzer is suitable for use (in (3).) While the logic for 
such an analyzer is probably not too complicate, it seems to demand quite 
some Unicode knowledge to create one.
 Is my approach the right one? Is there an analyzer suitable to use?
 Thanks.
- HB

 On 9/5/05, Olivier Jaquemet <[EMAIL PROTECTED]> wrote: 
> 
> Hi,
> 
> I'd like to go in details regarding issues that occurs when you want to
> index and search contents in multiple languages.
> 
> I have read Lucene in Action book, and many thread on this mailing list,
> the most interesting so far being this one:
> 
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200505.mbox/[EMAIL 
> PROTECTED]
> 
> The solution choosen/recommended by Doug Cutting in this message:
> 
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200506.mbox/[EMAIL 
> PROTECTED]
> is the number '2/':
> Having one index for all languages one Document per content's language
> with a field specify its language, and using a query filter when 
> searching.
> 
> While I think it is a good solution:
> - If you have N languages, if you search for something in 1 language,
> you are going to search an index N times too large.
> Wouldn't it be better to have N indices for N languages? That way, each
> index could benefit of its specialized analyser, and if you need to
> search in multiple languages, you just need to merge result of those
> differents analyzer.
> - If you have contents in multiple language like we do, and by that I
> don't mean multiple contents each one having its own language, but
> multiple content, each one being in many languages. You are going to
> have a N to 1, Document/content relation in the index.
> As far as update, delete, and search in multiple language are concerned,
> wouldn't it be simpler to alway keep a 1 to 1 Document/content relation
> in an index?
> 
> As you may have guess, my original thought, even before I read those
> thread, was that the solution number 3. might be more flexible/modular
> than the others, of course it also has its drawbacks:
> - performance issue when doing multiple language search, specially when
> merging results of different index.
> - more complex to code
> - other?
> 
> Can you clarify on this?
> What solutions all of you have choosen til now regarding indexing and
> searching of multiple content in multiple language ?
> 
> Thanks!
> 
> Olivier
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

Multi-lang analyzer? Re: Multiple Language Indexing and Searching

Reply via email to