Thanks Robert. That makes sense. Do you have a link handy where I can find this information? i.e. word boundary/punctuation for any unicode character set?
On Fri, Mar 30, 2012 at 12:57 PM, Robert Muir <rcm...@gmail.com> wrote: > On Fri, Mar 30, 2012 at 12:46 PM, Denis Brodeur <denisbrod...@gmail.com> > wrote: > > Hello, I'm currently working out some problems when searching for Tibetan > > Characters. More specifically: /u0f10-/u0f19. We are using the > > unicode doesn't consider most of these characters part of a word: most > are punctuation and symbols > (except 0f18 and 0f19 which are combining characters that combine with > digits). > > for example 0f14 is a text delimiter. > > in general standardtokenizer discards punctuation and is geared at > word boundaries, just like > you would have trouble searching on characters like '(', etc in > english. So i think its totally expected. > > -- > lucidimagination.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >