Hi all,

Has anyone had any luck using StandardTokenizer for
Unicode behind Latin-1 set? I have tried to use it for
Cyrillic (U+0400..U+04FF) and it looks like the
characters don't get through, despite the fact that
Cyrillic IS included in StandardTokenizer.jj (i.e. is a
subset of Unicode symbols, used to describe the Letter
token). If I try to specify UNICODE_INPUT = true in
StandardTokenizer.jj (and disable USER_CHAR_STREAM =
true), it starts working perfectly.
So does that mean I have to have my own version of
StandardTokenizer to make Unicode input possible?

Boris Okner 

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to