Thanks for the explanation Michael. I read more about term vectors and your
explanation in combination helps put things into perspective. .

On Thu, Jan 13, 2022 at 8:53 AM Michael Sokolov <msoko...@gmail.com> wrote:

> I think the "broken offsets" refers to offsets of tokens "going
> backwards". Offsets are attributes of tokens that refer back to their
> byte position in the original indexed text. Going backwards means -- a
> token with a greater position (in the sequence of tokens, or token
> graph) should not have a lesser (or maybe it must be strictly
> increasing I forget) offset. If you use term vectors, and have these
> broken offsets, which should not but do often occur with custom
> analysis chains, this could be a problem.
>
> On Wed, Jan 12, 2022 at 12:36 AM Rahul Goswami <rahul196...@gmail.com>
> wrote:
> >
> > Thanks Vinay for the link to Erick's talk! I hadn't seen it and I must
> > admit it did help put a few things into perspective.
> >
> > I was able to track down the JIRAs (thank you 'git blame')
> > surrounding/leading up to this architectural decision and the linked
> > patches:
> > https://issues.apache.org/jira/browse/LUCENE-7703  (Record the version
> that
> > was used at index creation time)
> > https://issues.apache.org/jira/browse/LUCENE-7730  (Better encode length
> > normalization in similarities)
> > https://issues.apache.org/jira/browse/LUCENE-7837  (Use
> > indexCreatedVersionMajor to fail opening too old indices)
> >
> > From these JIRAs what I was able to piece together is that if not
> > reindexed, relevance scoring might act in unpredictable ways. For my use
> > case, I can live with that since we provide an explicit sort on one or
> more
> > fields.
> >
> > In LUCENE-7703, Adrien says "we will reject broken offsets in term
> vectors
> > as of 7.0". So my questions to the community are
> > i) What are these offsets, and what feature/s might break with respect to
> > these offsets if not reindexed?
> > ii) Do the length normalization changes in  LUCENE-7730 affect only
> > relevance scores?
> >
> > I understand I could be playing with fire here, but reindexing is not a
> > practical solution for my situation. At least not in the near future
> until
> > I figure out a more seamless way of reindexing with minimal downtime
> given
> > that there are multiple 1TB+ indexes. Would appreciate inputs from the
> dev
> > community on this.
> >
> > Thanks,
> > Rahul
> >
> > On Sun, Jan 9, 2022 at 2:41 PM Vinay Rajput <vinayrajput4...@gmail.com>
> > wrote:
> >
> > > Hi Rahul,
> > >
> > > I am not an expert so someone else might provide a better answer.
> However,
> > > I remember
> > > @Erick briefly talked about this restriction in one of his talks here:-
> > > https://www.youtube.com/watch?v=eaQBH_H3d3g&t=621s (not sure if you
> have
> > > seen it already).
> > >
> > > As he explains, earlier it looked like IndexUpgrader tool was doing
> the job
> > > perfectly but it wasn't always the case. There is no guarantee that
> after
> > > using the IndexUpgrader tool, your 8.x index will keep all of the
> > > characteristics of lucene 8. There can be some situations (e.g.
> incorrect
> > > offset) where you might get an incorrect relevance score which might be
> > > difficult to trace and debug. So, Lucene developers now made it
> explicit
> > > that what people were doing earlier was not ideal, and they should now
> plan
> > > to reindex all the documents during the major upgrade.
> > >
> > > Having said that, what you have done can just work without any issue as
> > > long as you don't encounter any odd sorting behavior. This may/may not
> be
> > > super critical depending on the business use case and that is where you
> > > might need to make a decision.
> > >
> > > Thanks,
> > > Vinay
> > >
> > > On Sat, Jan 8, 2022 at 10:27 PM Rahul Goswami <rahul196...@gmail.com>
> > > wrote:
> > >
> > > > Hello,
> > > > Would appreciate any insights on the issue.Are there any backward
> > > > incompatible changes in 8.x index because of which the lucene
> upgrader is
> > > > unable to upgrade any index EVER touched by <= 6.x ? Or is the
> > > restriction
> > > > more of a safety net at this point for possible future
> incompatibilities
> > > ?
> > > >
> > > > Thanks,
> > > > Rahul
> > > >
> > > > On Thu, Jan 6, 2022 at 11:46 PM Rahul Goswami <rahul196...@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Hello,
> > > > > I am using Apache Solr 7.7.2 with indexes which were originally
> created
> > > > on
> > > > > 4.8 and upgraded ever since. I recently tried upgrading to 8.x
> using
> > > the
> > > > > lucene IndexUpgrader tool and the upgrade fails. I know that
> lucene 8.x
> > > > > prevents opening any segment which was touched by <= 6.x at any
> point
> > > in
> > > > > the past. I also know the general recommendation is to reindex upon
> > > > > migration to another major release, however it is not always
> feasible.
> > > > >
> > > > > So I tried to remove the check for LATEST-1 in SegmentInfos.java (
> > > > >
> > > >
> > >
> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/index/SegmentInfos.java#L321
> > > > )
> > > > > and also checked for other references to
> IndexFormatTooOldException.
> > > > Turns
> > > > > out that removing this check and rebuilding lucene-core lets the
> > > upgrade
> > > > go
> > > > > through fine. I ran a full sequence of index upgrades from 5.x ->
> 6.x
> > > ->
> > > > > 7.x ->8.x. which went through fine. Also search/update operations
> work
> > > > > without any issues in 8.x.
> > > > >
> > > > > I could not find any JIRAs which talk about the technical reason
> behind
> > > > > imposing this restriction, and would like to know the
> nitty-gritties.
> > > > Also
> > > > > would like to know about any potential pitfalls that I might be
> > > > overlooking
> > > > > with the above hack.
> > > > >
> > > > > Thanks,
> > > > > Rahul
> > > > >
> > > > >
> > > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Reply via email to