Re: Multi-analyzer ?

Andy Roberts Mon, 11 Apr 2005 05:27:40 -0700

Can you not provide the user with a option list to specify their input 
language?

Language identification can be a pretty tricky field. There are some tricks 
you can do with unicode to identify language, e.g., \u0600 - \u06FF contains 
the Arabic characters, so if you're input contains lots of chars within this 
range, you can guess that the input is Arabic, for example.

The problem comes with differentiating between the languages that use a Latin 
alphabet. Again, there are multiple approaches, although the only one I know 
of that worked pretty well for identifying European languages was to build a 
model based on character bigrams (that is, sequences of two letters) [1]

At the end of the day, Lucene cannot help you in choosing the correct language 
as it doesn't know, and so it'll be up to you to add the necessary logic to 
tell Lucene which Analyzers to utilise. :(

Andy

[1] Churcher, G E; Hayes, J; Hughes, J S; Johnson, S; Souter, C. Bigram and 
trigram models for language identification and classification in: Evett, L & 
Rose,T (editors) Computational Linguistics for Speech and Handwriting 
Recognition AISB'94 Workshop University of Leeds/AISB. 1994.

On Monday 11 Apr 2005 01:21, Eric Chow wrote:
> Hello,
>
> If I don't know the language of the input terms, how can I use
> different analyzer to search it ?
>
> For example, the input box accepts UTF-8 search text, they can be
> anything, such as Chinese, Japanese, English, Russian, Deuch, etc. How
> can search any of them or all of them with Lucene?
>
> Any example, please?
>
>
> Best Regards,
> Eric
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Multi-analyzer ?

Reply via email to