Re: Handling Indian regional languages

Robert Muir Mon, 16 Jan 2023 09:46:11 -0800

On Tue, Jan 10, 2023 at 2:04 AM Kumaran Ramasubramanian
<kums....@gmail.com> wrote:
>
> For handling Indian regional languages, what is the advisable approach?
>
> 1. Indexing each language data(Tamil, Hindi etc) in specific fields like
> content_tamil, content_hindi with specific per field Analyzer like Tamil
> for content_tamil, HindiAnalyzer for content_hindi?


You don't need to do this just to tokenize. You only need to do this
if you want to do something fancier on top (e.g. stemming and so on).
If you look at newer lucene versions, there are more analyzers for
more languages.

>
> 2. Indexing all language data in the same field but handling tokenization
> with specific unicode range(similar to THAI) in tokenizer like mentioned
> below..
>
> THAI       = [\u0E00-\u0E59]
> > TAMIL      = [\u0B80-\u0BFF]
> > // basic word: a sequence of digits & letters (includes Thai to enable
> > ThaiAnalyzer to function)
> > ALPHANUM   = ({LETTER}|{THAI}|{TAMIL}|[:digit:])+

Don't do this: Just use StandardTokenizer instead of ClassicTokenizer.
StandardTokenizer can tokenize all the Indian writing systems
out-of-box.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Handling Indian regional languages

Reply via email to