Honestly the only time i've seen the mixed up files before (and the
motivation for the paranoid checks in lucene), was bugs in some
distributed replication code. In this case code that was copying files
across the network had some bugs (e.g. used hashing of file contents
to try to reduce network chatter but didn't handle hash collisions
properly). So it would actually most commonly happen for .si file
simply because it is typically a tiny file and more likely to cause
hash collisions in some distributed code doing that. This was the
motivation for adding unique id to each segment and all files
corresponding to that segment... basically as a library, we can't
trust filenames to be what they claim.

segments_N doesn't just reference your segments by names like _8w and
_94 but it also has segment's unique IDs, too. Would have to look at
its file format to tell you how to see this with your hex editor. But
in general, the segment unique ID is referenced everywhere, starting
from segments_N. This way, when loading any index files for that
segment (including *.si), lucene checks they have matching ID so that
we know they really do belong to that segment. Because we can't trust
filenames when users may manipulate them :)

If the file really belongs to another segment (e.g. because files got
mixed up), there's a clear error this way that files are mixed up.
otherwise, without this check, you get pure insanity trying to debug
problems when files get mixed up.

On Wed, Apr 13, 2022 at 10:39 PM Tim Whittington <t...@apache.org> wrote:
>
> Using a known-broken Lucene index directory, I dropped down to the Lucene
> API and tracked this down a bit further.
>
> My directory listing is this:
>
> ----------------
> 17 Mar 13:39 _8w.fdt
> 17 Mar 13:39 _8w.fdx
> 17 Mar 13:39 _8w.fnm
> 17 Mar 13:39 _8w.nvd
> 17 Mar 13:39 _8w.nvm
> 17 Mar 13:39 _8w.si
> 17 Mar 13:39 _8w_Lucene50_0.doc
> 17 Mar 13:39 _8w_Lucene50_0.pos
> 17 Mar 13:39 _8w_Lucene50_0.tim
> 17 Mar 13:39 _8w_Lucene50_0.tip
> 17 Mar 13:39 _8w_Lucene70_0.dvd
> 17 Mar 13:39 _8w_Lucene70_0.dvm
> 17 Mar 14:33 _8x.cfe
> 17 Mar 14:33 _8x.cfs
> 20 Mar 21:19 _8x.fdt
> 20 Mar 21:19 _8x.fdx
> 20 Mar 21:19 _8x.fnm
> 20 Mar 21:19 _8x.nvd
> 20 Mar 21:19 _8x.nvm
> 20 Mar 21:19 _8x.si
> 20 Mar 21:19 _8x_Lucene50_0.doc
> 20 Mar 21:19 _8x_Lucene50_0.pos
> 20 Mar 21:19 _8x_Lucene50_0.tim
> 20 Mar 21:19 _8x_Lucene50_0.tip
> 20 Mar 21:19 _8x_Lucene70_0.dvd
> 20 Mar 21:19 _8x_Lucene70_0.dvm
> 20 Mar 21:19 _8y.cfe
> 20 Mar 21:19 _8y.cfs
> 20 Mar 21:19 _8y.si
> 20 Mar 21:19 _8z.cfe
> 20 Mar 21:19 _8z.cfs
> 20 Mar 21:19 _8z.si
> 20 Mar 21:19 _90.cfe
> 20 Mar 21:19 _90.cfs
> 20 Mar 21:19 _90.si
> 20 Mar 21:19 _91.cfe
> 20 Mar 21:19 _91.cfs
> 20 Mar 21:19 _91.si
> 20 Mar 21:19 _92.cfe
> 20 Mar 21:19 _92.cfs
> 20 Mar 21:19 _92.si
> 20 Mar 21:19 _93.cfe
> 20 Mar 21:19 _93.cfs
> 20 Mar 21:19 _93.si
> 20 Mar 21:19 _94.cfe
> 20 Mar 21:19 _94.cfs
> 20 Mar 21:19 _94.si
> 20 Mar 21:19 _95.cfe
> 20 Mar 21:19 _95.cfs
> 20 Mar 21:19 _95.si
> 18 Mar 06:49 segments_93
> 20 Mar 21:19 segments_96
> 6 Mar 21:22 write.lock
>
> ----------------
>
> When I load SegmentInfos for segments_96 directly, it succeeds, and I can
> see it's referencing all the SegmentInfo except for _8w.
> If I try to load SegmentInfos for segments_93, it gets past loading _8w and
> fails on _8x.
> Checking with a hex editor, segments_93 is referencing _8w ... _94 and
> segments_96 is referencing _8x ... _95
>
> The IndexWriter failure is due to the IndexFileDeleter attempting to load
> segments_93 to track referenced commit infos.
>
> Is this a state an IndexWriter could get the directory into, or does it
> involve higher level interference (like copying files around)?
>
> Tim
>
> On Thu, 14 Apr 2022 at 13:20, Baris Kazar <baris.ka...@oracle.com> wrote:
>
> > yes that is a great point to look at first and that would eliminate any
> > jdbc related issues that may lead to such problems.
> > Best regards
> > ________________________________
> > From: Tim Whittington <t...@apache.org>
> > Sent: Wednesday, April 13, 2022 9:17:44 PM
> > To: java-user@lucene.apache.org <java-user@lucene.apache.org>
> > Subject: Re: How to handle corrupt Lucene index
> >
> > Thanks for this - I'll have a look at the database server code that is
> > managing the Lucene indexes and see if I can track it down.
> >
> > Tim
> >
> > On Thu, 14 Apr 2022 at 12:41, Robert Muir <rcm...@gmail.com> wrote:
> >
> > > On Wed, Apr 13, 2022 at 8:24 PM Tim Whittington
> > > <t...@whittington.nz.invalid> wrote:
> > > >
> > > > I'm working with/on a database system that uses Lucene for full text
> > > > indexes (currently using 7.3.0).
> > > > We're encountering occasional problems that occur after unclean
> > shutdowns
> > > > of the database , resulting in
> > > > "org.apache.lucene.index.CorruptIndexException: file mismatch" errors
> > > when
> > > > the IndexWriter is constructed.
> > > >
> > > > In all of the cases this has occurred, CheckIndex finds no issues with
> > > the
> > > > Lucene index.
> > > >
> > > > The database has write-ahead-log and recovery facilities, so making the
> > > > Lucene indexes durable wrt database operations is doable, but in this
> > > case
> > > > the IndexWriter itself is failing to initialise, so it looks like there
> > > > needs to be a lower-level validation/recovery operation before
> > > reconciling
> > > > transactions can take place.
> > > >
> > > > Can anyone provide any advice about how the database can detect and
> > > recover
> > > > from this situation?
> > > >
> > >
> > > File mismatch means files are getting mixed up. It is the equivalent
> > > of swapping say, /etc/hosts and /etc/passwd on your computer.
> > >
> > > In your case you have a .si file (lets say it is named _79.si) that
> > > really belongs to another segment (e.g. _42).
> > >
> > > This isn't a lucene issue, this is something else you must be using
> > > that is "transporting files around", and it is mixing the files up.
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to