On Nov 16, 2009, at 7:53 PM, Robert Muir wrote: > right, the only way you could really contain it would be to do something like > that.
I'm looking forward to your ICU analyzer! IMHO, it be great to have it be a pluggable replacement for it's counterparts in core. That is, using reflection, if the jar is present, then use it. > > I just think we should make users aware of this, thats all. I've been reading the thread and at first my response was. No big deal, it won't affect me (i.e. awareness of the problem). And now my thought is "I'm hosed" (i.e. understanding). I think we need a mechanism (I mentioned this before) to build a manifest of the parts of the tool chain that builds each field in an index. Then if any part is revisioned in a fashion that is not 100% bw compat, then we'd know. As it is, I'm just going to mark each index as dirty on each upgrade to Lucene, Java or ICU. And force a rebuild. > and I think it sucks they might have to reindex twice with the current status > of things (we did not complete unicode 4 support in lucene 3.0) > which is why i mentioned this problem on the unicode 4 issues im trying to > work. Whether 3.0 goes out as it is now or with these fixes is up to the voters. > > 2.9->3.0 (to upgrade from Unicode 3 to Unicode 4-halfass) > 3.0->3.1 (to upgrade from Unicode 4-halfass to Unicode 4-correct) [hopefully] If this is the path, then perhaps the best advice is to skip 3.0 and take the pain once. > > btw, i created a diff from unicode 3's UCD to unicode 4's UCD, in case you > want to see the changes: http://people.apache.org/~rmuir/unicodeDiff.txt That's an amazing number of changes, even when you ignore name changes. > > On Mon, Nov 16, 2009 at 7:42 PM, DM Smith <dmsmith...@gmail.com> wrote: > > On Nov 16, 2009, at 6:43 PM, Robert Muir wrote: > > > DM, in this case I'm not referring to surrogates, etc, but instead the idea > > that properties for an existing character can change (the soft hyphen and > > arabic ayah were two examples), also new characters are introduced. > > > > these will affect what analysis components (ex. tokenizers) do, because > > they like to use categories such as .isWhiteSpace, .isLetter, things like > > that. > > > > this means these components have different behavior, because they are > > data-driven, even though we didnt change any code. > > Then why not make ICU a dependency. At least then one has control of the > delivered version. Any of us that are working with texts in non latin-1 > languages are likely to be using ICU anyway. > > -- DM > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > > > > -- > Robert Muir > rcm...@gmail.com