Re: Handling Indian regional languages

Kumaran Ramasubramanian Mon, 23 Jan 2023 08:56:02 -0800

Hi Robert Muir, we will check on this. Thanks a lot for the pointers.

--
*K*umaran
*R*




On Mon, Jan 16, 2023 at 11:16 PM Robert Muir <rcm...@gmail.com> wrote:

> On Tue, Jan 10, 2023 at 2:04 AM Kumaran Ramasubramanian
> <kums....@gmail.com> wrote:
> >
> > For handling Indian regional languages, what is the advisable approach?
> >
> > 1. Indexing each language data(Tamil, Hindi etc) in specific fields like
> > content_tamil, content_hindi with specific per field Analyzer like Tamil
> > for content_tamil, HindiAnalyzer for content_hindi?
>
> You don't need to do this just to tokenize. You only need to do this
> if you want to do something fancier on top (e.g. stemming and so on).
> If you look at newer lucene versions, there are more analyzers for
> more languages.
>
> >
> > 2. Indexing all language data in the same field but handling tokenization
> > with specific unicode range(similar to THAI) in tokenizer like mentioned
> > below..
> >
> > THAI       = [\u0E00-\u0E59]
> > > TAMIL      = [\u0B80-\u0BFF]
> > > // basic word: a sequence of digits & letters (includes Thai to enable
> > > ThaiAnalyzer to function)
> > > ALPHANUM   = ({LETTER}|{THAI}|{TAMIL}|[:digit:])+
>
> Don't do this: Just use StandardTokenizer instead of ClassicTokenizer.
> StandardTokenizer can tokenize all the Indian writing systems
> out-of-box.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Handling Indian regional languages

Reply via email to