Re: Problems Indexing/Parsing Tibetan Text

Denis Brodeur Fri, 30 Mar 2012 10:04:21 -0700

Thanks Robert.  That makes sense.  Do you have a link handy where I can
find this information? i.e. word boundary/punctuation for any unicode
character set?


On Fri, Mar 30, 2012 at 12:57 PM, Robert Muir <rcm...@gmail.com> wrote:

> On Fri, Mar 30, 2012 at 12:46 PM, Denis Brodeur <denisbrod...@gmail.com>
> wrote:
> > Hello, I'm currently working out some problems when searching for Tibetan
> > Characters.  More specifically: /u0f10-/u0f19.  We are using the
>
> unicode doesn't consider most of these characters part of a word: most
> are punctuation and symbols
> (except 0f18 and 0f19 which are combining characters that combine with
> digits).
>
> for example 0f14 is a text delimiter.
>
> in general standardtokenizer discards punctuation and is geared at
> word boundaries, just like
> you would have trouble searching on characters like '(', etc in
> english. So i think its totally expected.
>
> --
> lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Problems Indexing/Parsing Tibetan Text

Reply via email to