Hmmm, not sure, but in looking at DocumentsWriter, it seems like lines around 553 might be at issue:
if (tvx != null) {
        tvx.writeLong(tvd.getFilePointer());
        if (numVectorFields > 0) {
          tvd.writeVInt(numVectorFields);
          for(int i=0;i<numVectorFields;i++)
            tvd.writeVInt(vectorFieldNumbers[i]);
          assert 0 == vectorFieldPointers[0];
          tvd.writeVLong(tvf.getFilePointer());
          long lastPos = vectorFieldPointers[0];
          for(int i=1;i<numVectorFields;i++) {
            long pos = vectorFieldPointers[i];
            tvd.writeVLong(pos-lastPos);
            lastPos = pos;
          }
          tvfLocal.writeTo(tvf);
          tvfLocal.reset();
        }
      }

Specifically, the exception being thrown seems to be that it is trying to read in a vInt that contains the number of fields that have vectors. However, in DocumentsWriter, it only writes out this vInt if the numVectorFields is > 0.

I think you might try:
if (numVectorFields > 0){
....
}
else{
tvd.writeVInt(0)
}

In the old TermVectorsWriter, it used to be:
 private void writeDoc() throws IOException {
    if (isFieldOpen())
throw new IllegalStateException("Field is still open while writing document");
    //System.out.println("Writing doc pointer: " + currentDocPointer);
    // write document index record
    tvx.writeLong(currentDocPointer);

    // write document data record
    final int size = fields.size();

    // write the number of fields
    tvd.writeVInt(size);

    // write field numbers
    for (int i = 0; i < size; i++) {
      TVField field = (TVField) fields.elementAt(i);
      tvd.writeVInt(field.number);
    }

http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_2_0/src/java/ org/apache/lucene/index/TermVectorsWriter.java?view=markup



On Sep 28, 2007, at 4:26 PM, Andi Vajda wrote:


On Fri, 28 Sep 2007, Andi Vajda wrote:

I found a bug with indexing documents that contain fields with Term Vectors. The indexing fails with 'reading past EOF' errors in what seems the index optimizing phase during addIndexes(). (I index first into a RAMDirectory, then addIndexes() into an FSDIrectory).

I have not filed the bug yet formally as I need to isolate the code. If I turn indexing with term vectors off, indexing completes fine.

I tried all morning to isolate the problem but I seem to be unable to reproduce it in a simple unit test. In my application, I've been able to get errors by doing even less: just creating a FSDirectory and adding documents with fields with term vectors fails when optimizing the index with the error below. I even tried to add the same documents, in the same order, in the unit test but to no avail. It just works.

What is different about my environment ? Well, I'm running PyLucene, but the new one, the one using a Apple's Java VM, the same VM I'm using to run the unit test. And I'm not doing anything special like calling back into Python or something, I'm just calling regular Lucene APIs adding documents into an IndexWriter on an FSDirectory using a StandardAnalyzer. If I stop using term vectors, all is working fine.

I'd like to get to the bottom of this but could use some help. Does the stacktrace below ring a bell ? Is there a way to run the whole indexing and optimizing in one single thread ?

Thanks !

Andi..

Exception in thread "Thread-4" org.apache.lucene.index.MergePolicy $MergeException: java.io.IOException: read past EOF at org.apache.lucene.index.ConcurrentMergeScheduler $MergeThread.run(ConcurrentMergeScheduler.java:263)
Caused by: java.io.IOException: read past EOF
at org.apache.lucene.store.BufferedIndexInput.refill (BufferedIndexInput.java:146) at org.apache.lucene.store.BufferedIndexInput.readByte (BufferedIndexInput.java:38) at org.apache.lucene.store.IndexInput.readVInt (IndexInput.java:76) at org.apache.lucene.index.TermVectorsReader.get (TermVectorsReader.java:207) at org.apache.lucene.index.SegmentReader.getTermFreqVectors (SegmentReader.java:692) at org.apache.lucene.index.SegmentMerger.mergeVectors (SegmentMerger.java:279) at org.apache.lucene.index.SegmentMerger.merge (SegmentMerger.java:122) at org.apache.lucene.index.IndexWriter.mergeMiddle (IndexWriter.java:2898) at org.apache.lucene.index.IndexWriter.merge (IndexWriter.java:2647) at org.apache.lucene.index.ConcurrentMergeScheduler $MergeThread.run(ConcurrentMergeScheduler.java:232) java.io.IOException: background merge hit exception: _5u:c372 _5v:c5 into _5w [optimize] at org.apache.lucene.index.IndexWriter.optimize (IndexWriter.java:1621) at org.apache.lucene.index.IndexWriter.optimize (IndexWriter.java:1571)
Caused by: java.io.IOException: read past EOF
at org.apache.lucene.store.BufferedIndexInput.refill (BufferedIndexInput.java:146) at org.apache.lucene.store.BufferedIndexInput.readByte (BufferedIndexInput.java:38) at org.apache.lucene.store.IndexInput.readVInt (IndexInput.java:76) at org.apache.lucene.index.TermVectorsReader.get (TermVectorsReader.java:207) at org.apache.lucene.index.SegmentReader.getTermFreqVectors (SegmentReader.java:692) at org.apache.lucene.index.SegmentMerger.mergeVectors (SegmentMerger.java:279) at org.apache.lucene.index.SegmentMerger.merge (SegmentMerger.java:122) at org.apache.lucene.index.IndexWriter.mergeMiddle (IndexWriter.java:2898) at org.apache.lucene.index.IndexWriter.merge (IndexWriter.java:2647) at org.apache.lucene.index.ConcurrentMergeScheduler $MergeThread.run(ConcurrentMergeScheduler.java:232)

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to