For handling Indian regional languages, what is the advisable approach? 1. Indexing each language data(Tamil, Hindi etc) in specific fields like content_tamil, content_hindi with specific per field Analyzer like Tamil for content_tamil, HindiAnalyzer for content_hindi?
2. Indexing all language data in the same field but handling tokenization with specific unicode range(similar to THAI) in tokenizer like mentioned below.. THAI = [\u0E00-\u0E59] > TAMIL = [\u0B80-\u0BFF] > // basic word: a sequence of digits & letters (includes Thai to enable > ThaiAnalyzer to function) > ALPHANUM = ({LETTER}|{THAI}|{TAMIL}|[:digit:])+ Note: I am using lucene 4.10.4. But open to know suggestions from latest lucene versions as well as lucene 4.. -- *K*umaran *R*