right, the only way you could really contain it would be to do something
like that.

I just think we should make users aware of this, thats all.
and I think it sucks they might have to reindex twice with the current
status of things (we did not complete unicode 4 support in lucene 3.0)
which is why i mentioned this problem on the unicode 4 issues im trying to
work.

2.9->3.0 (to upgrade from Unicode 3 to Unicode 4-halfass)
3.0->3.1 (to upgrade from Unicode 4-halfass to Unicode 4-correct)
[hopefully]

btw, i created a diff from unicode 3's UCD to unicode 4's UCD, in case you
want to see the changes: http://people.apache.org/~rmuir/unicodeDiff.txt

On Mon, Nov 16, 2009 at 7:42 PM, DM Smith <dmsmith...@gmail.com> wrote:

>
> On Nov 16, 2009, at 6:43 PM, Robert Muir wrote:
>
> > DM, in this case I'm not referring to surrogates, etc, but instead the
> idea that properties for an existing character can change (the soft hyphen
> and arabic ayah were two examples), also new characters are introduced.
> >
> > these will affect what analysis components (ex. tokenizers) do, because
> they like to use categories such as .isWhiteSpace, .isLetter, things like
> that.
> >
> > this means these components have different behavior, because they are
> data-driven, even though we didnt change any code.
>
> Then why not make ICU a dependency. At least then one has control of the
> delivered version. Any of us that are working with texts in non latin-1
> languages are likely to be using ICU anyway.
>
> -- DM
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


-- 
Robert Muir
rcm...@gmail.com

Reply via email to