On Fri, Mar 30, 2012 at 12:46 PM, Denis Brodeur <[email protected]> wrote:
> Hello, I'm currently working out some problems when searching for Tibetan
> Characters. More specifically: /u0f10-/u0f19. We are using the
unicode doesn't consider most of these characters part of a word: most
are punctuation and symbols
(except 0f18 and 0f19 which are combining characters that combine with digits).
for example 0f14 is a text delimiter.
in general standardtokenizer discards punctuation and is geared at
word boundaries, just like
you would have trouble searching on characters like '(', etc in
english. So i think its totally expected.
--
lucidimagination.com
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]