On Nov 16, 2009, at 6:43 PM, Robert Muir wrote: > DM, in this case I'm not referring to surrogates, etc, but instead the idea > that properties for an existing character can change (the soft hyphen and > arabic ayah were two examples), also new characters are introduced. > > these will affect what analysis components (ex. tokenizers) do, because they > like to use categories such as .isWhiteSpace, .isLetter, things like that. > > this means these components have different behavior, because they are > data-driven, even though we didnt change any code.
Then why not make ICU a dependency. At least then one has control of the delivered version. Any of us that are working with texts in non latin-1 languages are likely to be using ICU anyway. -- DM --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org