On 11/29/2010 03:43 PM, Earwin Burrfoot wrote:
On Mon, Nov 29, 2010 at 20:51, DM Smith<dmsmith...@gmail.com> wrote:
The other thing I'd like is for the spec to be save along side of the index
as a manifest. From earlier threads, I can see that there might need to be
one for writing and another for reading. I'm not interested in using it to
construct an analyzer, but to determine whether the index is invalid wrt to
the analyzer currently in use.
You can already implement such behaviour with 3.x branch of Lucene.
It has IW.commit(Map<String, String> userdata) method, that allows you
to commit with arbitrary payload, that binds to segment and can be
read back later.
Cool. I forgot entirely about that.
I think there is a problem with deprecating and removing constants too.
In trunk, which will be 4.0, it needs to be able to read and/or upgrade 2.x
indexes. From an analyzer perspective, an index is invalid if the analyzer
would produce a different token stream for the same input. If the 2.x
version constants are gone, then the index built with 2.x version
constants is no longer valid. (It might be valid, but how can one have any
confidence of that?) Upgrading the index to the new internal format
cannot change this. A buggy lowercase Turkish word will still be buggy
after upgrade. (This is a 3.0 version constant that in 5.0 will still need to
be around).
I think it was declared that Lucene does not provide index
compatibility across more than a single major revision.
Thus, we don't guarantee reading 2.x index with 4.0 Lucene. So, we can
drop 2.x constants and compatibility.
But we still have to support 3.x. In version 5.0 then we're dropping
3.x constants and support for bugs/deprecated
features of 3.x.
Yes, you are correct that 4.0 may but is not guaranteed to read 2.x. My
bad, yet again. I went back to the threads regarding this around May 25
and it also was decided that 4.x might not be able to read 3.x, but will
provide a migration tool in such a case.
That said, my point still stands. The 3.0 version constant which is used
by an analyzer to preserve 3.0 behavior will need to be retained for the
sake of analyzers in 5.0. Or the index will need to be rebuilt from
original input. (I'm referencing the 3.0 rather than a 2.x because of
the example I have in mind)
The tokens in the 3.0 index that is migrated to a 4.0 index still have
tokens produced by an analyzer that was buggy. Example, a Turkish index
with the wrong lower case i (Prior to LUCENE-2101, it would lowercase to
i. After: İ (dotted capital I) => i ("regular" lower case i) and I
("regular" upper case I) => 𝚤 (dotless lower case i)). This very
commonly occurs in Turkish text. So the 4.0 index, still using 3.0
version constant to get expected behavior, works as it always did.
Now in 5.0, there might be a migration tool or it will be able to read a
4.x index. If the 3.0 constant is gone and none of these tokens are
reachable. Search requests will have the correct lower case i and will
not be able to find those with the wrong one. It will be very obvious.
Regarding this analyzer, code that uses a 2.x version constant for this
analyzer will need to change to a 3.0 version constant in order for the
index to be usable in the 4.x series if the 2.x constants are removed.
I don't think this is an isolated example.
With what's happening, every index that uses a deprecated version
constant will have one very long major release cycle in which to rebuild
their indexes from scratch.
And as I said at the bottom of my last email, I'm going to re-index
because I am able and because I want correct behavior. So whatever is
decided won't affect my application of Lucene.
-- DM
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org