Guoqiang,

This often happens when an index failed to write the field data
correctly for some reason.

Try building a new index from your source data. This may just work, if
you had some kind of abnormal file i/o issues during indexing.

That said, the problem is more likely related to Unicode surrogate pairs.


To explain in more detail:

What the Error Message Means
-----------------------------------------------

In the field data file, the length of a variable length field ( like a
string) precedes the data itself. This tells Lucene how much data to
read. If your index is corrupt, when Lucene tries to read the data, if
the field length data is missing or incorrect, the stream position
becomes misaligned and ends up attempting to read past the end of the
file.


How the Index Gets Corrupt
----------------------------------------

Since Lucene stores strings in UTF8 encoding, and since it uses a
custom encoder and decoder to do that, if your content contains
Unicode characters which the Lucene encoder can't handle, this could
cause the field data to be written incorrectly. Specifcally, around
the 2.9.X builds there was much debate about cross-platform handling
of unicode surrogates, specifically with regard to U+FFFF, and it's
special treatment in Java. One of the concerns raised in those
discussion for Java Lucene was that this may cause issues when porting
to other platforms.

I do not recall the details of the final outcome of that debate, but I
imagine that the index corruption you're experiencing is related to
the presence of Unicode code points that fall into this problematic
range.

You may consider pre-filtering your text content to remove such code
points before storage, or use some kind of escaping method or
encoding/decoding prior to index storage. This may require a custom
analyzer to work around as well.

See these issues in Lucene Java:
https://issues.apache.org/jira/browse/LUCENE-2016
https://issues.apache.org/jira/browse/LUCENE-2126
https://issues.apache.org/jira/browse/LUCENE-2019

Also see:
http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Special_code_points
http://www.ibm.com/developerworks/java/library/j-unicode/index.html



Thanks,
Troy


On Nov 6, 2010 2:36 AM, "吴国强" <[email protected]> wrote:

Reply via email to