Hi Robert Muir, we will check on this. Thanks a lot for the pointers. -- *K*umaran *R*
On Mon, Jan 16, 2023 at 11:16 PM Robert Muir <rcm...@gmail.com> wrote: > On Tue, Jan 10, 2023 at 2:04 AM Kumaran Ramasubramanian > <kums....@gmail.com> wrote: > > > > For handling Indian regional languages, what is the advisable approach? > > > > 1. Indexing each language data(Tamil, Hindi etc) in specific fields like > > content_tamil, content_hindi with specific per field Analyzer like Tamil > > for content_tamil, HindiAnalyzer for content_hindi? > > You don't need to do this just to tokenize. You only need to do this > if you want to do something fancier on top (e.g. stemming and so on). > If you look at newer lucene versions, there are more analyzers for > more languages. > > > > > 2. Indexing all language data in the same field but handling tokenization > > with specific unicode range(similar to THAI) in tokenizer like mentioned > > below.. > > > > THAI = [\u0E00-\u0E59] > > > TAMIL = [\u0B80-\u0BFF] > > > // basic word: a sequence of digits & letters (includes Thai to enable > > > ThaiAnalyzer to function) > > > ALPHANUM = ({LETTER}|{THAI}|{TAMIL}|[:digit:])+ > > Don't do this: Just use StandardTokenizer instead of ClassicTokenizer. > StandardTokenizer can tokenize all the Indian writing systems > out-of-box. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >