I think the "broken offsets" refers to offsets of tokens "going backwards". Offsets are attributes of tokens that refer back to their byte position in the original indexed text. Going backwards means -- a token with a greater position (in the sequence of tokens, or token graph) should not have a lesser (or maybe it must be strictly increasing I forget) offset. If you use term vectors, and have these broken offsets, which should not but do often occur with custom analysis chains, this could be a problem.
On Wed, Jan 12, 2022 at 12:36 AM Rahul Goswami <rahul196...@gmail.com> wrote: > > Thanks Vinay for the link to Erick's talk! I hadn't seen it and I must > admit it did help put a few things into perspective. > > I was able to track down the JIRAs (thank you 'git blame') > surrounding/leading up to this architectural decision and the linked > patches: > https://issues.apache.org/jira/browse/LUCENE-7703 (Record the version that > was used at index creation time) > https://issues.apache.org/jira/browse/LUCENE-7730 (Better encode length > normalization in similarities) > https://issues.apache.org/jira/browse/LUCENE-7837 (Use > indexCreatedVersionMajor to fail opening too old indices) > > From these JIRAs what I was able to piece together is that if not > reindexed, relevance scoring might act in unpredictable ways. For my use > case, I can live with that since we provide an explicit sort on one or more > fields. > > In LUCENE-7703, Adrien says "we will reject broken offsets in term vectors > as of 7.0". So my questions to the community are > i) What are these offsets, and what feature/s might break with respect to > these offsets if not reindexed? > ii) Do the length normalization changes in LUCENE-7730 affect only > relevance scores? > > I understand I could be playing with fire here, but reindexing is not a > practical solution for my situation. At least not in the near future until > I figure out a more seamless way of reindexing with minimal downtime given > that there are multiple 1TB+ indexes. Would appreciate inputs from the dev > community on this. > > Thanks, > Rahul > > On Sun, Jan 9, 2022 at 2:41 PM Vinay Rajput <vinayrajput4...@gmail.com> > wrote: > > > Hi Rahul, > > > > I am not an expert so someone else might provide a better answer. However, > > I remember > > @Erick briefly talked about this restriction in one of his talks here:- > > https://www.youtube.com/watch?v=eaQBH_H3d3g&t=621s (not sure if you have > > seen it already). > > > > As he explains, earlier it looked like IndexUpgrader tool was doing the job > > perfectly but it wasn't always the case. There is no guarantee that after > > using the IndexUpgrader tool, your 8.x index will keep all of the > > characteristics of lucene 8. There can be some situations (e.g. incorrect > > offset) where you might get an incorrect relevance score which might be > > difficult to trace and debug. So, Lucene developers now made it explicit > > that what people were doing earlier was not ideal, and they should now plan > > to reindex all the documents during the major upgrade. > > > > Having said that, what you have done can just work without any issue as > > long as you don't encounter any odd sorting behavior. This may/may not be > > super critical depending on the business use case and that is where you > > might need to make a decision. > > > > Thanks, > > Vinay > > > > On Sat, Jan 8, 2022 at 10:27 PM Rahul Goswami <rahul196...@gmail.com> > > wrote: > > > > > Hello, > > > Would appreciate any insights on the issue.Are there any backward > > > incompatible changes in 8.x index because of which the lucene upgrader is > > > unable to upgrade any index EVER touched by <= 6.x ? Or is the > > restriction > > > more of a safety net at this point for possible future incompatibilities > > ? > > > > > > Thanks, > > > Rahul > > > > > > On Thu, Jan 6, 2022 at 11:46 PM Rahul Goswami <rahul196...@gmail.com> > > > wrote: > > > > > > > Hello, > > > > I am using Apache Solr 7.7.2 with indexes which were originally created > > > on > > > > 4.8 and upgraded ever since. I recently tried upgrading to 8.x using > > the > > > > lucene IndexUpgrader tool and the upgrade fails. I know that lucene 8.x > > > > prevents opening any segment which was touched by <= 6.x at any point > > in > > > > the past. I also know the general recommendation is to reindex upon > > > > migration to another major release, however it is not always feasible. > > > > > > > > So I tried to remove the check for LATEST-1 in SegmentInfos.java ( > > > > > > > > > https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/index/SegmentInfos.java#L321 > > > ) > > > > and also checked for other references to IndexFormatTooOldException. > > > Turns > > > > out that removing this check and rebuilding lucene-core lets the > > upgrade > > > go > > > > through fine. I ran a full sequence of index upgrades from 5.x -> 6.x > > -> > > > > 7.x ->8.x. which went through fine. Also search/update operations work > > > > without any issues in 8.x. > > > > > > > > I could not find any JIRAs which talk about the technical reason behind > > > > imposing this restriction, and would like to know the nitty-gritties. > > > Also > > > > would like to know about any potential pitfalls that I might be > > > overlooking > > > > with the above hack. > > > > > > > > Thanks, > > > > Rahul > > > > > > > > > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org