right, the only way you could really contain it would be to do something like that.
I just think we should make users aware of this, thats all. and I think it sucks they might have to reindex twice with the current status of things (we did not complete unicode 4 support in lucene 3.0) which is why i mentioned this problem on the unicode 4 issues im trying to work. 2.9->3.0 (to upgrade from Unicode 3 to Unicode 4-halfass) 3.0->3.1 (to upgrade from Unicode 4-halfass to Unicode 4-correct) [hopefully] btw, i created a diff from unicode 3's UCD to unicode 4's UCD, in case you want to see the changes: http://people.apache.org/~rmuir/unicodeDiff.txt On Mon, Nov 16, 2009 at 7:42 PM, DM Smith <dmsmith...@gmail.com> wrote: > > On Nov 16, 2009, at 6:43 PM, Robert Muir wrote: > > > DM, in this case I'm not referring to surrogates, etc, but instead the > idea that properties for an existing character can change (the soft hyphen > and arabic ayah were two examples), also new characters are introduced. > > > > these will affect what analysis components (ex. tokenizers) do, because > they like to use categories such as .isWhiteSpace, .isLetter, things like > that. > > > > this means these components have different behavior, because they are > data-driven, even though we didnt change any code. > > Then why not make ICU a dependency. At least then one has control of the > delivered version. Any of us that are working with texts in non latin-1 > languages are likely to be using ICU anyway. > > -- DM > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com