On Tue, Jan 10, 2023 at 2:04 AM Kumaran Ramasubramanian <kums....@gmail.com> wrote: > > For handling Indian regional languages, what is the advisable approach? > > 1. Indexing each language data(Tamil, Hindi etc) in specific fields like > content_tamil, content_hindi with specific per field Analyzer like Tamil > for content_tamil, HindiAnalyzer for content_hindi?
You don't need to do this just to tokenize. You only need to do this if you want to do something fancier on top (e.g. stemming and so on). If you look at newer lucene versions, there are more analyzers for more languages. > > 2. Indexing all language data in the same field but handling tokenization > with specific unicode range(similar to THAI) in tokenizer like mentioned > below.. > > THAI = [\u0E00-\u0E59] > > TAMIL = [\u0B80-\u0BFF] > > // basic word: a sequence of digits & letters (includes Thai to enable > > ThaiAnalyzer to function) > > ALPHANUM = ({LETTER}|{THAI}|{TAMIL}|[:digit:])+ Don't do this: Just use StandardTokenizer instead of ClassicTokenizer. StandardTokenizer can tokenize all the Indian writing systems out-of-box. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org