Thanks for the explanation Michael. I read more about term vectors and your explanation in combination helps put things into perspective. .
On Thu, Jan 13, 2022 at 8:53 AM Michael Sokolov <msoko...@gmail.com> wrote: > I think the "broken offsets" refers to offsets of tokens "going > backwards". Offsets are attributes of tokens that refer back to their > byte position in the original indexed text. Going backwards means -- a > token with a greater position (in the sequence of tokens, or token > graph) should not have a lesser (or maybe it must be strictly > increasing I forget) offset. If you use term vectors, and have these > broken offsets, which should not but do often occur with custom > analysis chains, this could be a problem. > > On Wed, Jan 12, 2022 at 12:36 AM Rahul Goswami <rahul196...@gmail.com> > wrote: > > > > Thanks Vinay for the link to Erick's talk! I hadn't seen it and I must > > admit it did help put a few things into perspective. > > > > I was able to track down the JIRAs (thank you 'git blame') > > surrounding/leading up to this architectural decision and the linked > > patches: > > https://issues.apache.org/jira/browse/LUCENE-7703 (Record the version > that > > was used at index creation time) > > https://issues.apache.org/jira/browse/LUCENE-7730 (Better encode length > > normalization in similarities) > > https://issues.apache.org/jira/browse/LUCENE-7837 (Use > > indexCreatedVersionMajor to fail opening too old indices) > > > > From these JIRAs what I was able to piece together is that if not > > reindexed, relevance scoring might act in unpredictable ways. For my use > > case, I can live with that since we provide an explicit sort on one or > more > > fields. > > > > In LUCENE-7703, Adrien says "we will reject broken offsets in term > vectors > > as of 7.0". So my questions to the community are > > i) What are these offsets, and what feature/s might break with respect to > > these offsets if not reindexed? > > ii) Do the length normalization changes in LUCENE-7730 affect only > > relevance scores? > > > > I understand I could be playing with fire here, but reindexing is not a > > practical solution for my situation. At least not in the near future > until > > I figure out a more seamless way of reindexing with minimal downtime > given > > that there are multiple 1TB+ indexes. Would appreciate inputs from the > dev > > community on this. > > > > Thanks, > > Rahul > > > > On Sun, Jan 9, 2022 at 2:41 PM Vinay Rajput <vinayrajput4...@gmail.com> > > wrote: > > > > > Hi Rahul, > > > > > > I am not an expert so someone else might provide a better answer. > However, > > > I remember > > > @Erick briefly talked about this restriction in one of his talks here:- > > > https://www.youtube.com/watch?v=eaQBH_H3d3g&t=621s (not sure if you > have > > > seen it already). > > > > > > As he explains, earlier it looked like IndexUpgrader tool was doing > the job > > > perfectly but it wasn't always the case. There is no guarantee that > after > > > using the IndexUpgrader tool, your 8.x index will keep all of the > > > characteristics of lucene 8. There can be some situations (e.g. > incorrect > > > offset) where you might get an incorrect relevance score which might be > > > difficult to trace and debug. So, Lucene developers now made it > explicit > > > that what people were doing earlier was not ideal, and they should now > plan > > > to reindex all the documents during the major upgrade. > > > > > > Having said that, what you have done can just work without any issue as > > > long as you don't encounter any odd sorting behavior. This may/may not > be > > > super critical depending on the business use case and that is where you > > > might need to make a decision. > > > > > > Thanks, > > > Vinay > > > > > > On Sat, Jan 8, 2022 at 10:27 PM Rahul Goswami <rahul196...@gmail.com> > > > wrote: > > > > > > > Hello, > > > > Would appreciate any insights on the issue.Are there any backward > > > > incompatible changes in 8.x index because of which the lucene > upgrader is > > > > unable to upgrade any index EVER touched by <= 6.x ? Or is the > > > restriction > > > > more of a safety net at this point for possible future > incompatibilities > > > ? > > > > > > > > Thanks, > > > > Rahul > > > > > > > > On Thu, Jan 6, 2022 at 11:46 PM Rahul Goswami <rahul196...@gmail.com > > > > > > wrote: > > > > > > > > > Hello, > > > > > I am using Apache Solr 7.7.2 with indexes which were originally > created > > > > on > > > > > 4.8 and upgraded ever since. I recently tried upgrading to 8.x > using > > > the > > > > > lucene IndexUpgrader tool and the upgrade fails. I know that > lucene 8.x > > > > > prevents opening any segment which was touched by <= 6.x at any > point > > > in > > > > > the past. I also know the general recommendation is to reindex upon > > > > > migration to another major release, however it is not always > feasible. > > > > > > > > > > So I tried to remove the check for LATEST-1 in SegmentInfos.java ( > > > > > > > > > > > > > https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/index/SegmentInfos.java#L321 > > > > ) > > > > > and also checked for other references to > IndexFormatTooOldException. > > > > Turns > > > > > out that removing this check and rebuilding lucene-core lets the > > > upgrade > > > > go > > > > > through fine. I ran a full sequence of index upgrades from 5.x -> > 6.x > > > -> > > > > > 7.x ->8.x. which went through fine. Also search/update operations > work > > > > > without any issues in 8.x. > > > > > > > > > > I could not find any JIRAs which talk about the technical reason > behind > > > > > imposing this restriction, and would like to know the > nitty-gritties. > > > > Also > > > > > would like to know about any potential pitfalls that I might be > > > > overlooking > > > > > with the above hack. > > > > > > > > > > Thanks, > > > > > Rahul > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >