Yeah, I really appreciate the paranoia in the file format. This is a distributed/replicated database (I'd forgotten to mention that until you mentioned distributed replication), so I suspect the database server is shunting actual segment files around during a recovery process and getting things muddled up. I actually captured one of the other nodes, and it seems to have a similar problem, except it has 3 segments_ files (2 of which are identical to the ones in the index I listed).
I'll continue to dig through the database server code to track down what's causing this. Thanks a lot for the quick help. Tim On Thu, 14 Apr 2022 at 15:00, Robert Muir <rcm...@gmail.com> wrote: > Honestly the only time i've seen the mixed up files before (and the > motivation for the paranoid checks in lucene), was bugs in some > distributed replication code. In this case code that was copying files > across the network had some bugs (e.g. used hashing of file contents > to try to reduce network chatter but didn't handle hash collisions > properly). So it would actually most commonly happen for .si file > simply because it is typically a tiny file and more likely to cause > hash collisions in some distributed code doing that. This was the > motivation for adding unique id to each segment and all files > corresponding to that segment... basically as a library, we can't > trust filenames to be what they claim. > > segments_N doesn't just reference your segments by names like _8w and > _94 but it also has segment's unique IDs, too. Would have to look at > its file format to tell you how to see this with your hex editor. But > in general, the segment unique ID is referenced everywhere, starting > from segments_N. This way, when loading any index files for that > segment (including *.si), lucene checks they have matching ID so that > we know they really do belong to that segment. Because we can't trust > filenames when users may manipulate them :) > > If the file really belongs to another segment (e.g. because files got > mixed up), there's a clear error this way that files are mixed up. > otherwise, without this check, you get pure insanity trying to debug > problems when files get mixed up. > > On Wed, Apr 13, 2022 at 10:39 PM Tim Whittington <t...@apache.org> wrote: > > > > Using a known-broken Lucene index directory, I dropped down to the Lucene > > API and tracked this down a bit further. > > > > My directory listing is this: > > > > ---------------- > > 17 Mar 13:39 _8w.fdt > > 17 Mar 13:39 _8w.fdx > > 17 Mar 13:39 _8w.fnm > > 17 Mar 13:39 _8w.nvd > > 17 Mar 13:39 _8w.nvm > > 17 Mar 13:39 _8w.si > > 17 Mar 13:39 _8w_Lucene50_0.doc > > 17 Mar 13:39 _8w_Lucene50_0.pos > > 17 Mar 13:39 _8w_Lucene50_0.tim > > 17 Mar 13:39 _8w_Lucene50_0.tip > > 17 Mar 13:39 _8w_Lucene70_0.dvd > > 17 Mar 13:39 _8w_Lucene70_0.dvm > > 17 Mar 14:33 _8x.cfe > > 17 Mar 14:33 _8x.cfs > > 20 Mar 21:19 _8x.fdt > > 20 Mar 21:19 _8x.fdx > > 20 Mar 21:19 _8x.fnm > > 20 Mar 21:19 _8x.nvd > > 20 Mar 21:19 _8x.nvm > > 20 Mar 21:19 _8x.si > > 20 Mar 21:19 _8x_Lucene50_0.doc > > 20 Mar 21:19 _8x_Lucene50_0.pos > > 20 Mar 21:19 _8x_Lucene50_0.tim > > 20 Mar 21:19 _8x_Lucene50_0.tip > > 20 Mar 21:19 _8x_Lucene70_0.dvd > > 20 Mar 21:19 _8x_Lucene70_0.dvm > > 20 Mar 21:19 _8y.cfe > > 20 Mar 21:19 _8y.cfs > > 20 Mar 21:19 _8y.si > > 20 Mar 21:19 _8z.cfe > > 20 Mar 21:19 _8z.cfs > > 20 Mar 21:19 _8z.si > > 20 Mar 21:19 _90.cfe > > 20 Mar 21:19 _90.cfs > > 20 Mar 21:19 _90.si > > 20 Mar 21:19 _91.cfe > > 20 Mar 21:19 _91.cfs > > 20 Mar 21:19 _91.si > > 20 Mar 21:19 _92.cfe > > 20 Mar 21:19 _92.cfs > > 20 Mar 21:19 _92.si > > 20 Mar 21:19 _93.cfe > > 20 Mar 21:19 _93.cfs > > 20 Mar 21:19 _93.si > > 20 Mar 21:19 _94.cfe > > 20 Mar 21:19 _94.cfs > > 20 Mar 21:19 _94.si > > 20 Mar 21:19 _95.cfe > > 20 Mar 21:19 _95.cfs > > 20 Mar 21:19 _95.si > > 18 Mar 06:49 segments_93 > > 20 Mar 21:19 segments_96 > > 6 Mar 21:22 write.lock > > > > ---------------- > > > > When I load SegmentInfos for segments_96 directly, it succeeds, and I can > > see it's referencing all the SegmentInfo except for _8w. > > If I try to load SegmentInfos for segments_93, it gets past loading _8w > and > > fails on _8x. > > Checking with a hex editor, segments_93 is referencing _8w ... _94 and > > segments_96 is referencing _8x ... _95 > > > > The IndexWriter failure is due to the IndexFileDeleter attempting to load > > segments_93 to track referenced commit infos. > > > > Is this a state an IndexWriter could get the directory into, or does it > > involve higher level interference (like copying files around)? > > > > Tim > > > > On Thu, 14 Apr 2022 at 13:20, Baris Kazar <baris.ka...@oracle.com> > wrote: > > > > > yes that is a great point to look at first and that would eliminate any > > > jdbc related issues that may lead to such problems. > > > Best regards > > > ________________________________ > > > From: Tim Whittington <t...@apache.org> > > > Sent: Wednesday, April 13, 2022 9:17:44 PM > > > To: java-user@lucene.apache.org <java-user@lucene.apache.org> > > > Subject: Re: How to handle corrupt Lucene index > > > > > > Thanks for this - I'll have a look at the database server code that is > > > managing the Lucene indexes and see if I can track it down. > > > > > > Tim > > > > > > On Thu, 14 Apr 2022 at 12:41, Robert Muir <rcm...@gmail.com> wrote: > > > > > > > On Wed, Apr 13, 2022 at 8:24 PM Tim Whittington > > > > <t...@whittington.nz.invalid> wrote: > > > > > > > > > > I'm working with/on a database system that uses Lucene for full > text > > > > > indexes (currently using 7.3.0). > > > > > We're encountering occasional problems that occur after unclean > > > shutdowns > > > > > of the database , resulting in > > > > > "org.apache.lucene.index.CorruptIndexException: file mismatch" > errors > > > > when > > > > > the IndexWriter is constructed. > > > > > > > > > > In all of the cases this has occurred, CheckIndex finds no issues > with > > > > the > > > > > Lucene index. > > > > > > > > > > The database has write-ahead-log and recovery facilities, so > making the > > > > > Lucene indexes durable wrt database operations is doable, but in > this > > > > case > > > > > the IndexWriter itself is failing to initialise, so it looks like > there > > > > > needs to be a lower-level validation/recovery operation before > > > > reconciling > > > > > transactions can take place. > > > > > > > > > > Can anyone provide any advice about how the database can detect and > > > > recover > > > > > from this situation? > > > > > > > > > > > > > File mismatch means files are getting mixed up. It is the equivalent > > > > of swapping say, /etc/hosts and /etc/passwd on your computer. > > > > > > > > In your case you have a .si file (lets say it is named _79.si) that > > > > really belongs to another segment (e.g. _42). > > > > > > > > This isn't a lucene issue, this is something else you must be using > > > > that is "transporting files around", and it is mixing the files up. > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >