There are a couple of JIRA issues related to TVs as well, mostly edge
cases, but Andi might want to take a look at them to see if they
describe his situation.
-Grant
On Sep 29, 2007, at 8:35 AM, Michael McCandless wrote:
You are right Grant -- good catch!!! I have a unit test showing it
now. Thank you :)
So, this case is tickled if you have a doc (or docs) that have some
fields with term vectors enabled, but then later as part of the same
buffered set of docs you have 1 or more docs that have no fields with
term vectors enabled.
I'll fix it.
The thing is, from Andi's description I'm not sure this is the case
he's hitting? He said all docs have 5 fields, one of them with term
vectors enabled ... hmmm.
Mike
On Sat, 29 Sep 2007 07:59:13 -0400, "Grant Ingersoll"
<[EMAIL PROTECTED]> said:
Hmmm, not sure, but in looking at DocumentsWriter, it seems like
lines around 553 might be at issue:
if (tvx != null) {
tvx.writeLong(tvd.getFilePointer());
if (numVectorFields > 0) {
tvd.writeVInt(numVectorFields);
for(int i=0;i<numVectorFields;i++)
tvd.writeVInt(vectorFieldNumbers[i]);
assert 0 == vectorFieldPointers[0];
tvd.writeVLong(tvf.getFilePointer());
long lastPos = vectorFieldPointers[0];
for(int i=1;i<numVectorFields;i++) {
long pos = vectorFieldPointers[i];
tvd.writeVLong(pos-lastPos);
lastPos = pos;
}
tvfLocal.writeTo(tvf);
tvfLocal.reset();
}
}
Specifically, the exception being thrown seems to be that it is
trying to read in a vInt that contains the number of fields that have
vectors. However, in DocumentsWriter, it only writes out this vInt
if the numVectorFields is > 0.
I think you might try:
if (numVectorFields > 0){
....
}
else{
tvd.writeVInt(0)
}
In the old TermVectorsWriter, it used to be:
private void writeDoc() throws IOException {
if (isFieldOpen())
throw new IllegalStateException("Field is still open while
writing document");
//System.out.println("Writing doc pointer: " +
currentDocPointer);
// write document index record
tvx.writeLong(currentDocPointer);
// write document data record
final int size = fields.size();
// write the number of fields
tvd.writeVInt(size);
// write field numbers
for (int i = 0; i < size; i++) {
TVField field = (TVField) fields.elementAt(i);
tvd.writeVInt(field.number);
}
http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_2_0/src/java/
org/apache/lucene/index/TermVectorsWriter.java?view=markup
On Sep 28, 2007, at 4:26 PM, Andi Vajda wrote:
On Fri, 28 Sep 2007, Andi Vajda wrote:
I found a bug with indexing documents that contain fields with
Term Vectors. The indexing fails with 'reading past EOF' errors in
what seems the index optimizing phase during addIndexes(). (I
index first into a RAMDirectory, then addIndexes() into an
FSDIrectory).
I have not filed the bug yet formally as I need to isolate the
code. If I turn indexing with term vectors off, indexing completes
fine.
I tried all morning to isolate the problem but I seem to be unable
to reproduce it in a simple unit test. In my application, I've been
able to get errors by doing even less: just creating a FSDirectory
and adding documents with fields with term vectors fails when
optimizing the index with the error below. I even tried to add the
same documents, in the same order, in the unit test but to no
avail. It just works.
What is different about my environment ? Well, I'm running
PyLucene, but the new one, the one using a Apple's Java VM, the
same VM I'm using to run the unit test. And I'm not doing anything
special like calling back into Python or something, I'm just
calling regular Lucene APIs adding documents into an IndexWriter on
an FSDirectory using a StandardAnalyzer. If I stop using term
vectors, all is working fine.
I'd like to get to the bottom of this but could use some help. Does
the stacktrace below ring a bell ? Is there a way to run the whole
indexing and optimizing in one single thread ?
Thanks !
Andi..
Exception in thread "Thread-4" org.apache.lucene.index.MergePolicy
$MergeException: java.io.IOException: read past EOF
at org.apache.lucene.index.ConcurrentMergeScheduler
$MergeThread.run(ConcurrentMergeScheduler.java:263)
Caused by: java.io.IOException: read past EOF
at org.apache.lucene.store.BufferedIndexInput.refill
(BufferedIndexInput.java:146)
at org.apache.lucene.store.BufferedIndexInput.readByte
(BufferedIndexInput.java:38)
at org.apache.lucene.store.IndexInput.readVInt
(IndexInput.java:76)
at org.apache.lucene.index.TermVectorsReader.get
(TermVectorsReader.java:207)
at org.apache.lucene.index.SegmentReader.getTermFreqVectors
(SegmentReader.java:692)
at org.apache.lucene.index.SegmentMerger.mergeVectors
(SegmentMerger.java:279)
at org.apache.lucene.index.SegmentMerger.merge
(SegmentMerger.java:122)
at org.apache.lucene.index.IndexWriter.mergeMiddle
(IndexWriter.java:2898)
at org.apache.lucene.index.IndexWriter.merge
(IndexWriter.java:2647)
at org.apache.lucene.index.ConcurrentMergeScheduler
$MergeThread.run(ConcurrentMergeScheduler.java:232)
java.io.IOException: background merge hit exception: _5u:c372
_5v:c5 into _5w [optimize]
at org.apache.lucene.index.IndexWriter.optimize
(IndexWriter.java:1621)
at org.apache.lucene.index.IndexWriter.optimize
(IndexWriter.java:1571)
Caused by: java.io.IOException: read past EOF
at org.apache.lucene.store.BufferedIndexInput.refill
(BufferedIndexInput.java:146)
at org.apache.lucene.store.BufferedIndexInput.readByte
(BufferedIndexInput.java:38)
at org.apache.lucene.store.IndexInput.readVInt
(IndexInput.java:76)
at org.apache.lucene.index.TermVectorsReader.get
(TermVectorsReader.java:207)
at org.apache.lucene.index.SegmentReader.getTermFreqVectors
(SegmentReader.java:692)
at org.apache.lucene.index.SegmentMerger.mergeVectors
(SegmentMerger.java:279)
at org.apache.lucene.index.SegmentMerger.merge
(SegmentMerger.java:122)
at org.apache.lucene.index.IndexWriter.mergeMiddle
(IndexWriter.java:2898)
at org.apache.lucene.index.IndexWriter.merge
(IndexWriter.java:2647)
at org.apache.lucene.index.ConcurrentMergeScheduler
$MergeThread.run(ConcurrentMergeScheduler.java:232)
--------------------------------------------------------------------
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]