Hi Michael,
On 25/03/10 18:45, Michael McCandless wrote:
Hi Renaud,
It's great that you're pushing flex forward so much :) You're making
some cool sounding codecs! I'm really looking forward to seeing
indexing/searching performance results on Wikipedia...
I'll share them for sure whenever the results are ready ;o).
It sounds most likely there's a bug in the PFor impl? (Since you don't
hit this exception with the others...).
It seems so, but I found strange also that I cannot reproduce it with
synthetic data.
During merge, each segment's docIDs are rebased according to how many
non-deleted docs there are in all prior segments. One possibility
here is a given segment thought it had N deletions but in fact
encountered fewer than N while iterating its docs. This would cause
the next segment to have too-low a base which can cause this exact
exception on crossing from one segment to the next. (Ie the very
first doc of the next segment will suddenly be<= prior doc(s)).
But... if that's happening (ie, bug is in Lucene not in PFor impl),
you'd expect the other codecs to hit it too.
Are you using multiple threads for indexing? Are you also mixing in
deletions (or updateDocument calls)?
There is no deletion, I just create the index from scratch, and each
document I am adding as a unique identifier.
I am using one single thread for indexing: reading sequentially the list
of wikipedia articles, putting the content into a single field, and add
the document to the index. Commit is done every 10K documents.
I have tried with different mergeFactors (2, or 20), but whenever the
first merge occurs, I got this CorruptIndexException.
I will try to continue to debug, but if I could have at least the faulty
segment, and the faulty term (or even better, the index of the faulty
block), I will be able to display the content of the blocks, and see if
there is some problems in the PFor encoding.
Cheers,
--
Renaud Delbru
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]