The ICU project ( http://site.icu-project.org/ ) has Analyzers for Lucene and 
it has been ported to ElasticSearch.  Maybe those integrate better.

As to not doing some tokenization, I would think an extra tokenizer in you 
chain would be just the thing.

-Paul

> -----Original Message-----
> From: Trejkaz [mailto:[email protected]]
> Sent: Tuesday, January 08, 2013 3:44 PM
> To: [email protected]
> Subject: Re: Is StandardAnalyzer good enough for multi languages...
> 
> On Wed, Jan 9, 2013 at 6:30 AM, saisantoshi <[email protected]> wrote:
> > DoesLucene StandardAnalyzer work for all the languagues for tokenizing
> > before indexing (since we are using java, I think the content is
> > converted to UTF-8 before tokenizing/indeing)?
> 
> No. There are multiple cases where it chooses not to break something which it 
> should break. Some of
> these cases even result in undesirable behaviour for English, so I would be 
> surprised if there were even a
> single language which it handles acceptably.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to