On Mon, Nov 16, 2009 at 8:17 PM, DM Smith <dmsmith...@gmail.com> wrote:
> > thanks DM, I hope to work on it more soon... > > I've been reading the thread and at first my response was. No big deal, it > won't affect me (i.e. awareness of the problem). And now my thought is "I'm > hosed" (i.e. understanding) > I guess it depends on what characters/writing systems you are currently using. I think you know, this 3.0->4.0 is a pretty tough upgrade for unicode. > > I think we need a mechanism (I mentioned this before) to build a manifest > of the parts of the tool chain that builds each field in an index. Then if > any part is revisioned in a fashion that is not 100% bw compat, then we'd > know. > > As it is, I'm just going to mark each index as dirty on each upgrade to > Lucene, Java or ICU. And force a rebuild. > for what its worth, on an upgrade of ICU (typically minor unicode version, at most!) I would always reindex. This is a major unicode version upgrade. > > and I think it sucks they might have to reindex twice with the current > status of things (we did not complete unicode 4 support in lucene 3.0) > which is why i mentioned this problem on the unicode 4 issues im trying to > work. > > > Whether 3.0 goes out as it is now or with these fixes is up to the voters. > The problem is that we want 3.0 to be a 'clean' release with no deprecations. It is impossible to do so, and also have unicode 4 support in 3.0 (we will need to deprecate a few things) We couldnt do this in 2.9, because you need jdk 1.5 or icu to do even basic stuff like (U)Character.isLetter(int) :) > > > 2.9->3.0 (to upgrade from Unicode 3 to Unicode 4-halfass) > 3.0->3.1 (to upgrade from Unicode 4-halfass to Unicode 4-correct) > [hopefully] > > > If this is the path, then perhaps the best advice is to skip 3.0 and take > the pain once > . > I do not know if this is "the path", but you see how its virtually impossible to add improvements and still guarantee any backwards compatibility with any analysis stuff whatsoever, if it uses any JDK functions. Its not like TokenStream API, where its complicated yet still "under our control". There are variables outside of lucene. This is what makes me frustrated trying to make progress :) > > That's an amazing number of changes, even when you ignore name changes. > yeah they added over 1,000 characters! and here is some more information in addition to the diff: http://www.unicode.org/versions/Unicode4.0.0/ -- Robert Muir rcm...@gmail.com