On Thu, Mar 25, 2010 at 3:04 PM, Renaud Delbru <[email protected]> wrote: > Hi Michael, > > On 25/03/10 18:45, Michael McCandless wrote: >> >> Hi Renaud, >> >> It's great that you're pushing flex forward so much :) You're making >> some cool sounding codecs! I'm really looking forward to seeing >> indexing/searching performance results on Wikipedia... >> > > I'll share them for sure whenever the results are ready ;o).
I'll be waiting eagerly :) >> It sounds most likely there's a bug in the PFor impl? (Since you don't >> hit this exception with the others...). >> > > It seems so, but I found strange also that I cannot reproduce it with > synthetic data. Hmmm. >> During merge, each segment's docIDs are rebased according to how many >> non-deleted docs there are in all prior segments. One possibility >> here is a given segment thought it had N deletions but in fact >> encountered fewer than N while iterating its docs. This would cause >> the next segment to have too-low a base which can cause this exact >> exception on crossing from one segment to the next. (Ie the very >> first doc of the next segment will suddenly be<= prior doc(s)). >> >> But... if that's happening (ie, bug is in Lucene not in PFor impl), >> you'd expect the other codecs to hit it too. >> >> Are you using multiple threads for indexing? Are you also mixing in >> deletions (or updateDocument calls)? >> > > There is no deletion, I just create the index from scratch, and each > document I am adding as a unique identifier. Hmmm. > I am using one single thread for indexing: reading sequentially the list of > wikipedia articles, putting the content into a single field, and add the > document to the index. Commit is done every 10K documents. Are you using contrib/benchmark for this? That makes it very easy to run tests like this... hmm though we need to extend it so you can specify which Codec to use... > I have tried with different mergeFactors (2, or 20), but whenever the first > merge occurs, I got this CorruptIndexException. It's that consistent? Is it always that the docID is == to one prior? Or is the next docID sometimes < the prior one? And, is it always on the 1st docID of a new segment? > I will try to continue to debug, but if I could have at least the faulty > segment, and the faulty term (or even better, the index of the faulty > block), I will be able to display the content of the blocks, and see if > there is some problems in the PFor encoding. You can instrument the code (or catch the exc in a debugger) to see all these details? Or... if you can post a patch of where you are, I can dig, if I can repro the issue... Mike --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
