Re: How to handle corrupt Lucene index

Tim Whittington Wed, 13 Apr 2022 20:21:59 -0700

Yeah, I really appreciate the paranoia in the file format.

This is a distributed/replicated database (I'd forgotten to mention that
until you mentioned distributed replication), so I suspect the database
server is shunting actual segment files around during a recovery process
and getting things muddled up.
I actually captured one of the other nodes, and it seems to have a similar
problem, except it has 3 segments_ files (2 of which are identical to the
ones in the index I listed).


I'll continue to dig through the database server code to track down what's
causing this.

Thanks a lot for the quick help.
Tim

On Thu, 14 Apr 2022 at 15:00, Robert Muir <rcm...@gmail.com> wrote:

> Honestly the only time i've seen the mixed up files before (and the
> motivation for the paranoid checks in lucene), was bugs in some
> distributed replication code. In this case code that was copying files
> across the network had some bugs (e.g. used hashing of file contents
> to try to reduce network chatter but didn't handle hash collisions
> properly). So it would actually most commonly happen for .si file
> simply because it is typically a tiny file and more likely to cause
> hash collisions in some distributed code doing that. This was the
> motivation for adding unique id to each segment and all files
> corresponding to that segment... basically as a library, we can't
> trust filenames to be what they claim.
>
> segments_N doesn't just reference your segments by names like _8w and
> _94 but it also has segment's unique IDs, too. Would have to look at
> its file format to tell you how to see this with your hex editor. But
> in general, the segment unique ID is referenced everywhere, starting
> from segments_N. This way, when loading any index files for that
> segment (including *.si), lucene checks they have matching ID so that
> we know they really do belong to that segment. Because we can't trust
> filenames when users may manipulate them :)
>
> If the file really belongs to another segment (e.g. because files got
> mixed up), there's a clear error this way that files are mixed up.
> otherwise, without this check, you get pure insanity trying to debug
> problems when files get mixed up.
>
> On Wed, Apr 13, 2022 at 10:39 PM Tim Whittington <t...@apache.org> wrote:
> >
> > Using a known-broken Lucene index directory, I dropped down to the Lucene
> > API and tracked this down a bit further.
> >
> > My directory listing is this:
> >
> > ----------------
> > 17 Mar 13:39 _8w.fdt
> > 17 Mar 13:39 _8w.fdx
> > 17 Mar 13:39 _8w.fnm
> > 17 Mar 13:39 _8w.nvd
> > 17 Mar 13:39 _8w.nvm
> > 17 Mar 13:39 _8w.si
> > 17 Mar 13:39 _8w_Lucene50_0.doc
> > 17 Mar 13:39 _8w_Lucene50_0.pos
> > 17 Mar 13:39 _8w_Lucene50_0.tim
> > 17 Mar 13:39 _8w_Lucene50_0.tip
> > 17 Mar 13:39 _8w_Lucene70_0.dvd
> > 17 Mar 13:39 _8w_Lucene70_0.dvm
> > 17 Mar 14:33 _8x.cfe
> > 17 Mar 14:33 _8x.cfs
> > 20 Mar 21:19 _8x.fdt
> > 20 Mar 21:19 _8x.fdx
> > 20 Mar 21:19 _8x.fnm
> > 20 Mar 21:19 _8x.nvd
> > 20 Mar 21:19 _8x.nvm
> > 20 Mar 21:19 _8x.si
> > 20 Mar 21:19 _8x_Lucene50_0.doc
> > 20 Mar 21:19 _8x_Lucene50_0.pos
> > 20 Mar 21:19 _8x_Lucene50_0.tim
> > 20 Mar 21:19 _8x_Lucene50_0.tip
> > 20 Mar 21:19 _8x_Lucene70_0.dvd
> > 20 Mar 21:19 _8x_Lucene70_0.dvm
> > 20 Mar 21:19 _8y.cfe
> > 20 Mar 21:19 _8y.cfs
> > 20 Mar 21:19 _8y.si
> > 20 Mar 21:19 _8z.cfe
> > 20 Mar 21:19 _8z.cfs
> > 20 Mar 21:19 _8z.si
> > 20 Mar 21:19 _90.cfe
> > 20 Mar 21:19 _90.cfs
> > 20 Mar 21:19 _90.si
> > 20 Mar 21:19 _91.cfe
> > 20 Mar 21:19 _91.cfs
> > 20 Mar 21:19 _91.si
> > 20 Mar 21:19 _92.cfe
> > 20 Mar 21:19 _92.cfs
> > 20 Mar 21:19 _92.si
> > 20 Mar 21:19 _93.cfe
> > 20 Mar 21:19 _93.cfs
> > 20 Mar 21:19 _93.si
> > 20 Mar 21:19 _94.cfe
> > 20 Mar 21:19 _94.cfs
> > 20 Mar 21:19 _94.si
> > 20 Mar 21:19 _95.cfe
> > 20 Mar 21:19 _95.cfs
> > 20 Mar 21:19 _95.si
> > 18 Mar 06:49 segments_93
> > 20 Mar 21:19 segments_96
> > 6 Mar 21:22 write.lock
> >
> > ----------------
> >
> > When I load SegmentInfos for segments_96 directly, it succeeds, and I can
> > see it's referencing all the SegmentInfo except for _8w.
> > If I try to load SegmentInfos for segments_93, it gets past loading _8w
> and
> > fails on _8x.
> > Checking with a hex editor, segments_93 is referencing _8w ... _94 and
> > segments_96 is referencing _8x ... _95
> >
> > The IndexWriter failure is due to the IndexFileDeleter attempting to load
> > segments_93 to track referenced commit infos.
> >
> > Is this a state an IndexWriter could get the directory into, or does it
> > involve higher level interference (like copying files around)?
> >
> > Tim
> >
> > On Thu, 14 Apr 2022 at 13:20, Baris Kazar <baris.ka...@oracle.com>
> wrote:
> >
> > > yes that is a great point to look at first and that would eliminate any
> > > jdbc related issues that may lead to such problems.
> > > Best regards
> > > ________________________________
> > > From: Tim Whittington <t...@apache.org>
> > > Sent: Wednesday, April 13, 2022 9:17:44 PM
> > > To: java-user@lucene.apache.org <java-user@lucene.apache.org>
> > > Subject: Re: How to handle corrupt Lucene index
> > >
> > > Thanks for this - I'll have a look at the database server code that is
> > > managing the Lucene indexes and see if I can track it down.
> > >
> > > Tim
> > >
> > > On Thu, 14 Apr 2022 at 12:41, Robert Muir <rcm...@gmail.com> wrote:
> > >
> > > > On Wed, Apr 13, 2022 at 8:24 PM Tim Whittington
> > > > <t...@whittington.nz.invalid> wrote:
> > > > >
> > > > > I'm working with/on a database system that uses Lucene for full
> text
> > > > > indexes (currently using 7.3.0).
> > > > > We're encountering occasional problems that occur after unclean
> > > shutdowns
> > > > > of the database , resulting in
> > > > > "org.apache.lucene.index.CorruptIndexException: file mismatch"
> errors
> > > > when
> > > > > the IndexWriter is constructed.
> > > > >
> > > > > In all of the cases this has occurred, CheckIndex finds no issues
> with
> > > > the
> > > > > Lucene index.
> > > > >
> > > > > The database has write-ahead-log and recovery facilities, so
> making the
> > > > > Lucene indexes durable wrt database operations is doable, but in
> this
> > > > case
> > > > > the IndexWriter itself is failing to initialise, so it looks like
> there
> > > > > needs to be a lower-level validation/recovery operation before
> > > > reconciling
> > > > > transactions can take place.
> > > > >
> > > > > Can anyone provide any advice about how the database can detect and
> > > > recover
> > > > > from this situation?
> > > > >
> > > >
> > > > File mismatch means files are getting mixed up. It is the equivalent
> > > > of swapping say, /etc/hosts and /etc/passwd on your computer.
> > > >
> > > > In your case you have a .si file (lets say it is named _79.si) that
> > > > really belongs to another segment (e.g. _42).
> > > >
> > > > This isn't a lucene issue, this is something else you must be using
> > > > that is "transporting files around", and it is mixing the files up.
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > > >
> > > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: How to handle corrupt Lucene index

Reply via email to