On Mon, Nov 16, 2009 at 8:17 PM, DM Smith <dmsmith...@gmail.com> wrote:

>
>
thanks DM, I hope to work on it more soon...


>
> I've been reading the thread and at first my response was. No big deal, it
> won't affect me (i.e. awareness of the problem). And now my thought is "I'm
> hosed" (i.e. understanding)
>

I guess it depends on what characters/writing systems you are currently
using.
I think you know, this 3.0->4.0 is a pretty tough upgrade for unicode.


>
> I think we need a mechanism (I mentioned this before) to build a manifest
> of the parts of the tool chain that builds each field in an index. Then if
> any part is revisioned in a fashion that is not 100% bw compat, then we'd
> know.
>
> As it is, I'm just going to mark each index as dirty on each upgrade to
> Lucene, Java or ICU. And force a rebuild.
>

for what its worth, on an upgrade of ICU (typically minor unicode version,
at most!) I would always reindex.
This is a major unicode version upgrade.


>
> and I think it sucks they might have to reindex twice with the current
> status of things (we did not complete unicode 4 support in lucene 3.0)
> which is why i mentioned this problem on the unicode 4 issues im trying to
> work.
>
>
> Whether 3.0 goes out as it is now or with these fixes is up to the voters.
>

The problem is that we want 3.0 to be a 'clean' release with no
deprecations.
It is impossible to do so, and also have unicode 4 support in 3.0 (we will
need to deprecate a few things)
We couldnt do this in 2.9, because you need jdk 1.5 or icu to do even basic
stuff like (U)Character.isLetter(int) :)


>
>
> 2.9->3.0 (to upgrade from Unicode 3 to Unicode 4-halfass)
> 3.0->3.1 (to upgrade from Unicode 4-halfass to Unicode 4-correct)
> [hopefully]
>
>
> If this is the path, then perhaps the best advice is to skip 3.0 and take
> the pain once
>
.
>

I do not know if this is "the path", but you see how its virtually
impossible to add improvements and still guarantee any backwards
compatibility with any analysis stuff whatsoever, if it uses any JDK
functions.
Its not like TokenStream API, where its complicated yet still "under our
control". There are variables outside of lucene. This is what makes me
frustrated trying to make progress :)


>
> That's an amazing number of changes, even when you ignore name changes.
>

yeah they added over 1,000 characters!
and here is some more information in addition to the diff:
http://www.unicode.org/versions/Unicode4.0.0/


-- 
Robert Muir
rcm...@gmail.com

Reply via email to