Problems Indexing/Parsing Tibetan Text

Denis Brodeur Fri, 30 Mar 2012 09:47:15 -0700

Hello, I'm currently working out some problems when searching for Tibetan
Characters.  More specifically: /u0f10-/u0f19.  We are using the
StandardAnalyzer (3.4) and I've narrowed the problem down to
StandardTokenizerImpl throwing away these characters i.e. in
getNextToken(), falls through  case1: /* Not numeric, word, ideographic,
hiragana, or SE Asian -- ignore it */. So, the question is: is this the
expected behaviour and if it is what would be the best way to go about
supporting code points that are not recognized by the StandardAnalyzer in a
general way?

Problems Indexing/Parsing Tibetan Text

Reply via email to