Guoqiang, This often happens when an index failed to write the field data correctly for some reason.
Try building a new index from your source data. This may just work, if you had some kind of abnormal file i/o issues during indexing. That said, the problem is more likely related to Unicode surrogate pairs. To explain in more detail: What the Error Message Means ----------------------------------------------- In the field data file, the length of a variable length field ( like a string) precedes the data itself. This tells Lucene how much data to read. If your index is corrupt, when Lucene tries to read the data, if the field length data is missing or incorrect, the stream position becomes misaligned and ends up attempting to read past the end of the file. How the Index Gets Corrupt ---------------------------------------- Since Lucene stores strings in UTF8 encoding, and since it uses a custom encoder and decoder to do that, if your content contains Unicode characters which the Lucene encoder can't handle, this could cause the field data to be written incorrectly. Specifcally, around the 2.9.X builds there was much debate about cross-platform handling of unicode surrogates, specifically with regard to U+FFFF, and it's special treatment in Java. One of the concerns raised in those discussion for Java Lucene was that this may cause issues when porting to other platforms. I do not recall the details of the final outcome of that debate, but I imagine that the index corruption you're experiencing is related to the presence of Unicode code points that fall into this problematic range. You may consider pre-filtering your text content to remove such code points before storage, or use some kind of escaping method or encoding/decoding prior to index storage. This may require a custom analyzer to work around as well. See these issues in Lucene Java: https://issues.apache.org/jira/browse/LUCENE-2016 https://issues.apache.org/jira/browse/LUCENE-2126 https://issues.apache.org/jira/browse/LUCENE-2019 Also see: http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Special_code_points http://www.ibm.com/developerworks/java/library/j-unicode/index.html Thanks, Troy On Nov 6, 2010 2:36 AM, "吴国强" <[email protected]> wrote:
