Re: possible bug with indexing with term vectors

Michael McCandless Sat, 29 Sep 2007 05:35:45 -0700

You are right Grant -- good catch!!!  I have a unit test showing it
now.  Thank you :)


So, this case is tickled if you have a doc (or docs) that have some
fields with term vectors enabled, but then later as part of the same
buffered set of docs you have 1 or more docs that have no fields with
term vectors enabled.

I'll fix it.

The thing is, from Andi's description I'm not sure this is the case
he's hitting?  He said all docs have 5 fields, one of them with term
vectors enabled ... hmmm.

Mike

On Sat, 29 Sep 2007 07:59:13 -0400, "Grant Ingersoll" <[EMAIL PROTECTED]> said:
> Hmmm, not sure, but in looking at DocumentsWriter, it seems like  
> lines around 553 might be at issue:
> if (tvx != null) {
>          tvx.writeLong(tvd.getFilePointer());
>          if (numVectorFields > 0) {
>            tvd.writeVInt(numVectorFields);
>            for(int i=0;i<numVectorFields;i++)
>              tvd.writeVInt(vectorFieldNumbers[i]);
>            assert 0 == vectorFieldPointers[0];
>            tvd.writeVLong(tvf.getFilePointer());
>            long lastPos = vectorFieldPointers[0];
>            for(int i=1;i<numVectorFields;i++) {
>              long pos = vectorFieldPointers[i];
>              tvd.writeVLong(pos-lastPos);
>              lastPos = pos;
>            }
>            tvfLocal.writeTo(tvf);
>            tvfLocal.reset();
>          }
>        }
> 
> Specifically, the exception being thrown seems to be that it is  
> trying to read in a vInt that contains the number of fields that have  
> vectors.  However, in DocumentsWriter, it only writes out this vInt  
> if the numVectorFields is > 0.
> 
> I think you might try:
> if (numVectorFields > 0){
> ....
> }
> else{
> tvd.writeVInt(0)
> }
> 
> In the old TermVectorsWriter, it used to be:
>   private void writeDoc() throws IOException {
>      if (isFieldOpen())
>        throw new IllegalStateException("Field is still open while  
> writing document");
>      //System.out.println("Writing doc pointer: " + currentDocPointer);
>      // write document index record
>      tvx.writeLong(currentDocPointer);
> 
>      // write document data record
>      final int size = fields.size();
> 
>      // write the number of fields
>      tvd.writeVInt(size);
> 
>      // write field numbers
>      for (int i = 0; i < size; i++) {
>        TVField field = (TVField) fields.elementAt(i);
>        tvd.writeVInt(field.number);
>      }
> 
> http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_2_0/src/java/ 
> org/apache/lucene/index/TermVectorsWriter.java?view=markup
> 
> 
> 
> On Sep 28, 2007, at 4:26 PM, Andi Vajda wrote:
> 
> >
> > On Fri, 28 Sep 2007, Andi Vajda wrote:
> >
> >> I found a bug with indexing documents that contain fields with  
> >> Term Vectors. The indexing fails with 'reading past EOF' errors in  
> >> what seems the index optimizing phase during addIndexes(). (I  
> >> index first into a RAMDirectory, then addIndexes() into an  
> >> FSDIrectory).
> >>
> >> I have not filed the bug yet formally as I need to isolate the  
> >> code. If I turn indexing with term vectors off, indexing completes  
> >> fine.
> >
> > I tried all morning to isolate the problem but I seem to be unable  
> > to reproduce it in a simple unit test. In my application, I've been  
> > able to get errors by doing even less: just creating a FSDirectory  
> > and adding documents with fields with term vectors fails when  
> > optimizing the index with the error below. I even tried to add the  
> > same documents, in the same order, in the unit test but to no  
> > avail. It just works.
> >
> > What is different about my environment ? Well, I'm running  
> > PyLucene, but the new one, the one using a Apple's Java VM, the  
> > same VM I'm using to run the unit test. And I'm not doing anything  
> > special like calling back into Python or something, I'm just  
> > calling regular Lucene APIs adding documents into an IndexWriter on  
> > an FSDirectory using a StandardAnalyzer. If I stop using term  
> > vectors, all is working fine.
> >
> > I'd like to get to the bottom of this but could use some help. Does  
> > the stacktrace below ring a bell ? Is there a way to run the whole  
> > indexing and optimizing in one single thread ?
> >
> > Thanks !
> >
> > Andi..
> >
> > Exception in thread "Thread-4" org.apache.lucene.index.MergePolicy 
> > $MergeException: java.io.IOException: read past EOF
> >         at org.apache.lucene.index.ConcurrentMergeScheduler 
> > $MergeThread.run(ConcurrentMergeScheduler.java:263)
> > Caused by: java.io.IOException: read past EOF
> >         at org.apache.lucene.store.BufferedIndexInput.refill 
> > (BufferedIndexInput.java:146)
> >         at org.apache.lucene.store.BufferedIndexInput.readByte 
> > (BufferedIndexInput.java:38)
> >         at org.apache.lucene.store.IndexInput.readVInt 
> > (IndexInput.java:76)
> >         at org.apache.lucene.index.TermVectorsReader.get 
> > (TermVectorsReader.java:207)
> >         at org.apache.lucene.index.SegmentReader.getTermFreqVectors 
> > (SegmentReader.java:692)
> >         at org.apache.lucene.index.SegmentMerger.mergeVectors 
> > (SegmentMerger.java:279)
> >         at org.apache.lucene.index.SegmentMerger.merge 
> > (SegmentMerger.java:122)
> >         at org.apache.lucene.index.IndexWriter.mergeMiddle 
> > (IndexWriter.java:2898)
> >         at org.apache.lucene.index.IndexWriter.merge 
> > (IndexWriter.java:2647)
> >         at org.apache.lucene.index.ConcurrentMergeScheduler 
> > $MergeThread.run(ConcurrentMergeScheduler.java:232)
> > java.io.IOException: background merge hit exception: _5u:c372  
> > _5v:c5 into _5w [optimize]
> >         at org.apache.lucene.index.IndexWriter.optimize 
> > (IndexWriter.java:1621)
> >         at org.apache.lucene.index.IndexWriter.optimize 
> > (IndexWriter.java:1571)
> > Caused by: java.io.IOException: read past EOF
> >         at org.apache.lucene.store.BufferedIndexInput.refill 
> > (BufferedIndexInput.java:146)
> >         at org.apache.lucene.store.BufferedIndexInput.readByte 
> > (BufferedIndexInput.java:38)
> >         at org.apache.lucene.store.IndexInput.readVInt 
> > (IndexInput.java:76)
> >         at org.apache.lucene.index.TermVectorsReader.get 
> > (TermVectorsReader.java:207)
> >         at org.apache.lucene.index.SegmentReader.getTermFreqVectors 
> > (SegmentReader.java:692)
> >         at org.apache.lucene.index.SegmentMerger.mergeVectors 
> > (SegmentMerger.java:279)
> >         at org.apache.lucene.index.SegmentMerger.merge 
> > (SegmentMerger.java:122)
> >         at org.apache.lucene.index.IndexWriter.mergeMiddle 
> > (IndexWriter.java:2898)
> >         at org.apache.lucene.index.IndexWriter.merge 
> > (IndexWriter.java:2647)
> >         at org.apache.lucene.index.ConcurrentMergeScheduler 
> > $MergeThread.run(ConcurrentMergeScheduler.java:232)
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> 
> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
> http://lucene.grantingersoll.com
> http://www.paperoftheweek.com/
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: possible bug with indexing with term vectors

Reply via email to