Re: possible bug with indexing with term vectors

Grant Ingersoll Sat, 29 Sep 2007 04:59:52 -0700

Hmmm, not sure, but in looking at DocumentsWriter, it seems likelines around 553 might be at issue:

if (tvx != null) {
        tvx.writeLong(tvd.getFilePointer());
        if (numVectorFields > 0) {
          tvd.writeVInt(numVectorFields);
          for(int i=0;i<numVectorFields;i++)
            tvd.writeVInt(vectorFieldNumbers[i]);
          assert 0 == vectorFieldPointers[0];
          tvd.writeVLong(tvf.getFilePointer());
          long lastPos = vectorFieldPointers[0];
          for(int i=1;i<numVectorFields;i++) {
            long pos = vectorFieldPointers[i];
            tvd.writeVLong(pos-lastPos);
            lastPos = pos;
          }
          tvfLocal.writeTo(tvf);
          tvfLocal.reset();
        }
      }

Specifically, the exception being thrown seems to be that it istrying to read in a vInt that contains the number of fields that havevectors. However, in DocumentsWriter, it only writes out this vIntif the numVectorFields is > 0.


I think you might try:
if (numVectorFields > 0){
....
}
else{
tvd.writeVInt(0)
}

In the old TermVectorsWriter, it used to be:
 private void writeDoc() throws IOException {
    if (isFieldOpen())

throw new IllegalStateException("Field is still open whilewriting document");

    //System.out.println("Writing doc pointer: " + currentDocPointer);
    // write document index record
    tvx.writeLong(currentDocPointer);

    // write document data record
    final int size = fields.size();

    // write the number of fields
    tvd.writeVInt(size);

    // write field numbers
    for (int i = 0; i < size; i++) {
      TVField field = (TVField) fields.elementAt(i);
      tvd.writeVInt(field.number);
    }

http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_2_0/src/java/org/apache/lucene/index/TermVectorsWriter.java?view=markup




On Sep 28, 2007, at 4:26 PM, Andi Vajda wrote:

On Fri, 28 Sep 2007, Andi Vajda wrote:
I found a bug with indexing documents that contain fields withTerm Vectors. The indexing fails with 'reading past EOF' errors inwhat seems the index optimizing phase during addIndexes(). (Iindex first into a RAMDirectory, then addIndexes() into anFSDIrectory).
I have not filed the bug yet formally as I need to isolate thecode. If I turn indexing with term vectors off, indexing completesfine.
I tried all morning to isolate the problem but I seem to be unableto reproduce it in a simple unit test. In my application, I've beenable to get errors by doing even less: just creating a FSDirectoryand adding documents with fields with term vectors fails whenoptimizing the index with the error below. I even tried to add thesame documents, in the same order, in the unit test but to noavail. It just works.
What is different about my environment ? Well, I'm runningPyLucene, but the new one, the one using a Apple's Java VM, thesame VM I'm using to run the unit test. And I'm not doing anythingspecial like calling back into Python or something, I'm justcalling regular Lucene APIs adding documents into an IndexWriter onan FSDirectory using a StandardAnalyzer. If I stop using termvectors, all is working fine.
I'd like to get to the bottom of this but could use some help. Doesthe stacktrace below ring a bell ? Is there a way to run the wholeindexing and optimizing in one single thread ?
Thanks !

Andi..
Exception in thread "Thread-4" org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: read past EOFat org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:263)
Caused by: java.io.IOException: read past EOF
at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:146)at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76)at org.apache.lucene.index.TermVectorsReader.get(TermVectorsReader.java:207)at org.apache.lucene.index.SegmentReader.getTermFreqVectors(SegmentReader.java:692)at org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:279)at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:122)at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:2898)at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2647)at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:232)java.io.IOException: background merge hit exception: _5u:c372_5v:c5 into _5w [optimize]at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:1621)at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:1571)
Caused by: java.io.IOException: read past EOF
at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:146)at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76)at org.apache.lucene.index.TermVectorsReader.get(TermVectorsReader.java:207)at org.apache.lucene.index.SegmentReader.getTermFreqVectors(SegmentReader.java:692)at org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:279)at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:122)at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:2898)at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2647)at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:232)
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: possible bug with indexing with term vectors

Reply via email to