Handling Indian regional languages

Kumaran Ramasubramanian Mon, 09 Jan 2023 23:04:00 -0800

For handling Indian regional languages, what is the advisable approach?

1. Indexing each language data(Tamil, Hindi etc) in specific fields like
content_tamil, content_hindi with specific per field Analyzer like Tamil
for content_tamil, HindiAnalyzer for content_hindi?


2. Indexing all language data in the same field but handling tokenization
with specific unicode range(similar to THAI) in tokenizer like mentioned
below..

THAI       = [\u0E00-\u0E59]
> TAMIL      = [\u0B80-\u0BFF]
> // basic word: a sequence of digits & letters (includes Thai to enable
> ThaiAnalyzer to function)
> ALPHANUM   = ({LETTER}|{THAI}|{TAMIL}|[:digit:])+


Note: I am using lucene 4.10.4. But open to know suggestions from latest
lucene versions as well as lucene 4..


--
*K*umaran *R*

Handling Indian regional languages

Reply via email to