Re: possible bug with indexing with term vectors

Michael McCandless Fri, 28 Sep 2007 14:15:41 -0700

"Andi Vajda" <[EMAIL PROTECTED]> wrote:
>
> I tried all morning to isolate the problem but I seem to be unable
> to reproduce it in a simple unit test. In my application, I've been
> able to get errors by doing even less: just creating a FSDirectory
> and adding documents with fields with term vectors fails when
> optimizing the index with the error below. I even tried to add the
> same documents, in the same order, in the unit test but to no
> avail. It just works.


Are you trying your unit test first in Python (using PyLucene)?

> What is different about my environment ? Well, I'm running PyLucene,
> but the new one, the one using a Apple's Java VM, the same VM I'm
> using to run the unit test. And I'm not doing anything special like
> calling back into Python or something, I'm just calling regular
> Lucene APIs adding documents into an IndexWriter on an FSDirectory
> using a StandardAnalyzer. If I stop using term vectors, all is
> working fine.

Spooky.  It's definitely possible something is broken (there is alot
of new code in 2.3).

Are your documents irregular wrt term vectors?  (Ie some docs have
none, others store the terms but not positions/offsets, etc?).  Any
interesting changes to Lucene's defaults (autoCommit=false, etc)?

> I'd like to get to the bottom of this but could use some help. Does
> the stacktrace below ring a bell ? Is there a way to run the whole
> indexing and optimizing in one single thread ?

You can easily turn off the concurrent (background) merges by doing
this:

  writer.setMergeScheduler(new SerialMergeScheduler())

though that probably isn't punched through to Python in PyLucene.  You
can also build a Lucene JAR w/ a small change to IndexWriter.java to
do the same thing.

That stacktrace is happening while merging term vectors during an
optimize.  It's specifically occuring when loading the term vectors
for a given doc X; we read a position from the index stream (tvx) just
fine, but then when we try to read the first vInt from the document
stream (tvd) we hit the EOF exception.  So that position was too large
or the tvd file was somehow truncated.  Weird.

Can you call "writer.setInfoStream(System.out)" and get the error to
occur and then post the resulting log?  It may shed some light
here....

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: possible bug with indexing with term vectors

Reply via email to