The only change I made to the source code was the patch for PayloadNearQuery (LUCENE-1986). It's possible that our content contains U+FFFF. I will run in debugger and see. The data is 'sensitive', so I may not be able to provide a bad segment, unfortunately.
Peter On Wed, Oct 28, 2009 at 10:43 AM, Michael McCandless < [email protected]> wrote: > OK... when you exported the sources & built yourself, you didn't make > any changes, right? > > It's really odd how many of the errors are due to the term > "literals:cfid196$", or some variation (one time with "on" appended, > another time with "microsoft"). Do you know what documents typically > contain that term, and what the context is around it? Maybe try to > index only those documents and see if this happens? (It could > conceivably be caused by bad data, if this is some weird bug). One > question: does your content ever use the [invalid] unicode character > U+FFFF? (Lucene uses this internally to mark the end of the term). > > Would it be possible to zip up all files starting with _1c (should be > ~22 MB) and post somewhere that I could download? That's the smallest > of the broken segments I think. > > I don't need the full IW output just yet, thanks. > > Mike > > On Wed, Oct 28, 2009 at 10:21 AM, Peter Keegan <[email protected]> > wrote: > > Yes, I used JDK 1.6.0_16 when running CheckIndex and it reported the same > > problems when run multiple times. > > > >>Also, what does Lucene version "2.9 exported - 2009-10-27 15:31:52" mean? > > This appears to be something added by the ant build, since I built Lucene > > from the source code. > > > > I rebuilt the index using mergeFactor=50, ramBufferSize=200MB, > > maxBufferedDocs=1000000 > > This produced 49 segments, 9 of which are broken. The broken segments are > in > > the latter half, similar to my previous post with 3 segments. Do you > think > > this could be caused by 'bad' data, for example bad unicode characters? > > > > Here is the output from CheckIndex: > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
