As far as your usage is concerned, it seems to be the right approach, and I think the StandardAnalyzer does the job pretty right when it has to deal with whatever language you want. Though, note that it won't deal with all languages' stop words but the English ones, unless specified at index time But then if you change the stop words at index time, what should you use at query time, some query it won't work well.

But as far as I am concerned, each content (content in the sense of a CMS) is known to have multiple language, and each of these language *can* be indexed separately with no problem at all, and therefore a dedicated analyser could be use. So I was wondering whether my approach could be the right one of if it was over complex, and could introduce some problem I could not see... (My approach being: one index per language)
Advantages are:
- you always have the same analyzer for one index, so if want to benefit from some indexing capabilities in one language (stemmer, filter.. whatever), you can! - Should you need to search in all the language, you just need to do the query on every single index and you still benefit from each analyzer.
Inconvenients are
- You have to deal with as much indices as you have languages, but then again, if you do a search in only one language, it becomes an performance advantage I think. - You have to merge results from different index, this is a probleme when dealing with score, any suggestions? - Unless i'm wrong, you cannot use a MultipleSeacher, because only one analyzer can be specified, and not one analyzer per searcher (if someone could correct me if I'm wrong..)
- others ??

I don't know if the developpers of lucene would agree, but from what I've been browsing on the ML archives, those multiple language issues seems to arrise quite often in the mailing list, and maybe some articles like "best practices", "do's and don'ts" or "Lucene Architecture in multiple language environement", might be really nice to see :) If some of you have the time and the experience to write them I'll be really thankful! :)

Olivier

Hacking Bear wrote:

Hi,
I have the similar problem to deal with. In fact, a lot of times, the documents do not have any lanugage information or it may contain text in multiple languages. Further, the user would not like to always supply this information. Also the user may very well be interested in documents in multiple language. I think Google and other search engines allow indexing multi-lanugage documents. For example, if you google "Java", there will many matched documents in lanugages other than English. The only assumption we can make is that the document text are converted to Unicode before feeding to Lucene.

So I think the solution should be (1) create one index for all lanugage (2) add an advisory attribute like "lang" to specify the language of the document; if the language is unknown, just leave it empty or set to "ANY"; (3) based on the code pages of the upcoming Unicode character, we automatically switch among different analyzers to index the fragments of the text; (4) during search, unless the user explicitly requesting documents in certain language, we return all matches regardless of lanugage. I have browsed through the Lucene and contributed source codes, but I cannot tell which analyzer is suitable for use (in (3).) While the logic for such an analyzer is probably not too complicate, it seems to demand quite some Unicode knowledge to create one.
Is my approach the right one? Is there an analyzer suitable to use?
Thanks.
- HB

On 9/5/05, Olivier Jaquemet <[EMAIL PROTECTED]> wrote:
Hi,

I'd like to go in details regarding issues that occurs when you want to
index and search contents in multiple languages.

I have read Lucene in Action book, and many thread on this mailing list,
the most interesting so far being this one:

http://mail-archives.apache.org/mod_mbox/lucene-java-user/200505.mbox/[EMAIL 
PROTECTED]

The solution choosen/recommended by Doug Cutting in this message:

http://mail-archives.apache.org/mod_mbox/lucene-java-user/200506.mbox/[EMAIL 
PROTECTED]
is the number '2/':
Having one index for all languages one Document per content's language
with a field specify its language, and using a query filter when searching.

While I think it is a good solution:
- If you have N languages, if you search for something in 1 language,
you are going to search an index N times too large.
Wouldn't it be better to have N indices for N languages? That way, each
index could benefit of its specialized analyser, and if you need to
search in multiple languages, you just need to merge result of those
differents analyzer.
- If you have contents in multiple language like we do, and by that I
don't mean multiple contents each one having its own language, but
multiple content, each one being in many languages. You are going to
have a N to 1, Document/content relation in the index.
As far as update, delete, and search in multiple language are concerned,
wouldn't it be simpler to alway keep a 1 to 1 Document/content relation
in an index?

As you may have guess, my original thought, even before I read those
thread, was that the solution number 3. might be more flexible/modular
than the others, of course it also has its drawbacks:
- performance issue when doing multiple language search, specially when
merging results of different index.
- more complex to code
- other?

Can you clarify on this?
What solutions all of you have choosen til now regarding indexing and
searching of multiple content in multiple language ?

Thanks!

Olivier



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





--
Olivier Jaquemet <[EMAIL PROTECTED]>
Ingénieur R&D Jalios S.A.
Tel: 01.39.23.92.83
http://www.jalios.com/
http://support.jalios.com/




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to