Robert,
Thanks for chiming in and sharing your concerns. Even though it may seem
risky at first, I do believe this can be done in a safe way where Lucene
can have all the control to enforce the required checks-and-balances on the
integrity of the index and avoid bogus bug reports.

For further consideration, I have created a pull request to concretize the
idea here:
https://github.com/apache/lucene/pull/14607

Would appreciate the thoughts of the committers on this. I am happy to work
on filling any gaps to take this to closure or learn more if I am missing
anything here.

Regards,
Rahul

On Fri, Apr 25, 2025 at 1:36 AM Robert Muir <rcm...@gmail.com> wrote:

> I'm strongly opposed to allowing code to change/reset this value.
> Lucene needs to be able to defend itself from bogus bug reports for
> this and ensure back compat really works.
>
> I instead support Mark Miller's proposal to only increase the
> minimum-version when necessary on the lucene side:
> https://github.com/apache/lucene/issues/13797
>
> On Fri, Apr 25, 2025 at 1:24 AM Rahul Goswami <rahul196...@gmail.com>
> wrote:
> >
> > Adrien,
> > Appreciate your thoughts on this. I agree that the option of reindexing
> into a separate Directory is less "surgical" and definitely something to
> consider when the scale is manageable. Solr already has an API which can do
> this for SolrCloud. However, in my view, for cases where all the source
> fields are either stored or docValues true, the idea to support in-place
> upgrade merits a consideration due to the following reasons:
> >
> > 1) A lot of users may not have the budget or bandwidth to allocate
> additional resources to reindex into a parallel Directory. Especially in
> cases where there are thousands of indexes across hundreds of nodes, this
> process can quickly become cumbersome. This comes from a personal
> experience which was a major driver for me building this solution for my
> employer.
> >
> > 2) Search engines like Solr and Elasticsearch are often a piece in a
> bigger commercial software offering. In case of deployments which are in
> customer environments and not completely in control of the vendor, this
> proposition of having to completely reindex the data on to a parallel
> hardware can become a hard sell.
> >
> > 3) For install bases using Solr, Elasticsearch or OpenSearch, removing
> the overhead of index upgrade really helps them to look at the search
> engine as one homogenous piece of software that needs upgrading(when the
> time comes) rather than having to account for the index separately.
> >
> > I understand that in cases where the data type for an existing field
> changes (eg: Trie vs Point field as you mentioned), reindexing into a fresh
> Directory is the only option. However in my experience, since 8.x I have
> rarely seen this happen, and for such users I do see a way they can achieve
> significant savings in terms of cost and effort by means of in-place
> upgrading.
> >
> > Reindexing in-place can provide a solid alternative to the now retired
> Lucene IndexUpgrader Tool as a means to effectively achieve a lossless
> upgrade. It is the search engine's responsibility to build the guardrails
> around the upgrade process(eg: ensuring that all source fields are either
> stored or docValues true, or (maybe) disable updates via external API calls
> while such a reindexing is in progress etc). But I do believe the effort is
> worthy of consideration.
> >
> > Thanks,
> > Rahul
> >
> >
> >
> > On Thu, Apr 24, 2025 at 3:06 AM Adrien Grand <jpou...@gmail.com> wrote:
> >>
> >> Hi Rahul,
> >>
> >> I'm still concerned about going with in-place upgrading instead of
> reindexing in a separate directory. I don't mind opening an issue for
> discussion, but I'd like the option of reindexing into a separate directory
> to be considered. I think that it has lots of merits by avoiding multiple
> versions of the same analyzer to be used in the same index, allowing
> schemas to be upgraded (e.g. legacy GeoPoint -> LatLonPoint), etc. The
> downsides that come to mind are that it puts a bit more work on the
> application (rather than Lucene) and requires the upgrade to be done in a
> somewhat timely manner to be practical (or replaying the delta since the
> time when reindexing started may be heavy), but the trade-off still looks
> in favor of reindexing to a separate Directory to me.
> >>
> >> On Thu, Apr 24, 2025 at 5:34 AM Rahul Goswami <rahul196...@gmail.com>
> wrote:
> >>>
> >>> Hello,
> >>> I wanted to circle back on this discussion before I go ahead with the
> creation of a Github issue. If there are any additional/unanswered concerns
> about the request, I am happy to elaborate further.
> >>> Otherwise, I would like to go ahead and submit an issue and a PR for
> further consideration.
> >>>
> >>> Thanks,
> >>> Rahul
> >>>
> >>> On Sun, Apr 20, 2025 at 8:14 PM Rahul Goswami <rahul196...@gmail.com>
> wrote:
> >>>>
> >>>> Adrien,
> >>>> Thanks for your thoughts on this. To clarify, the request is not for
> a Lucene API which will upgrade the index, as I understand it might not
> always be possible to do a lossless upgrade of a cold index.
> >>>>
> >>>> The request is for an API which when called, will check the version
> of each segment in the SegmentInfos, and if each of them *already* has the
> latest version, change the index created version in SegmentInfos to the
> latest. Some API on IndexWriter, say commitAndUpdateCreatedVersion().
> >>>>
> >>>> None of the checks and balances or version compatibility restrictions
> from Lucene's side have to change. We (Solr) only aim to support this kind
> of reindexing between version X-1 and X. So as long as the schema adheres
> to certain conditions, you'd be able to go from X-1 to X to X+1 without
> needing to reindex from source, or provision infrastructure/effort for
> reindexing to a parallel copy, and still have a completely lossless index.
> >>>>
> >>>> If it helps for further consideration, I am also happy to demonstrate
> the implementation I have in mind for the API via a PR.
> >>>>
> >>>> - Rahul
> >>>>
> >>>> On Sun, Apr 20, 2025 at 4:12 AM Adrien Grand <jpou...@gmail.com>
> wrote:
> >>>>>
> >>>>> I worry that allowing us to reset the index created version on an
> existing index would only solve one part of the problem, while putting us
> on a path that makes it harder or impossible to solve the other part of the
> problem.
> >>>>>
> >>>>> If Lucene had had such an API from the beginning, we would still
> need to deal with legacy stuff, such as the old way that norms were encoded
> in the index (
> https://github.com/apache/lucene/blob/releases/lucene-solr/7.0.0/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java#L141-L145),
> or trie fields (
> https://github.com/apache/lucene/blob/releases/lucene-solr/5.0.0/lucene/core/src/java/org/apache/lucene/document/IntField.java).
> There currently is no way of upgrading norms or schemas in-place, and
> supporting this sounds quite scary while I can't think of major downsides
> of upgrading to a separate Directory and then atomically swapping pointers,
> especially as Solr/Elasticsearch already have logic for replicating
> operations to another copy of the data.
> >>>>>
> >>>>> This tells me that the upgraded index should not share state with
> the previous index. And the upgrade process should take care of reindexing
> to a new Directory while:
> >>>>>  - bumping the index creation version,
> >>>>>  - modernizing the schema if necessary (e.g. mapping "int" fields
> from trie fields in the previous index to point fields in the new index, or
> even sparse indexed fields in Lucene 10+),
> >>>>>  - upgrading norms and analysis chains if necessary.
> >>>>>
> >>>>>
> >>>>> On Sun, Apr 20, 2025 at 8:47 AM Ankit Jain <jain.ank...@gmail.com>
> wrote:
> >>>>>>
> >>>>>> Got it, thanks for providing additional context on the use case!
> >>>>>>
> >>>>>> On Sat, Apr 19, 2025 at 9:40 PM Rahul Goswami <
> rahul196...@gmail.com> wrote:
> >>>>>>>
> >>>>>>> There is an effort underway in Apache Solr where we want to
> provide a path to a legitimate upgrade without needing to reindex from
> source:
> >>>>>>> https://issues.apache.org/jira/browse/SOLR-17725
> >>>>>>>
> >>>>>>> Essentially the proposal is to read documents from segments where
> minVersion < current version and reindex them. At the same time, while the
> process is underway,  have a custom merge policy which would exclude such
> segments from merging with latest version segments to prevent pollution.
> >>>>>>>
> >>>>>>> Result is an index which only contains segments with minVersion
> and version stamps the same as the current Lucene version (essentially case
> #2 that we discussed). This index would in all respects be an "upgraded"
> index, but would need "indexCreatedVersionMajor" to be reset as well. This
> is where the Lucene API (to reset "indexCreatedVersionMajor") becomes
> essential.
> >>>>>>>
> >>>>>>> I believe this is a pattern which can also be adopted by other
> Lucene based search engines like Opensearch and Elasticsearch, and hence
> having this API could potentially benefit a large Lucene base.
> >>>>>>>
> >>>>>>> -Rahul
> >>>>>>>
> >>>>>>> On Sat, Apr 19, 2025 at 11:49 PM Ankit Jain <jain.ank...@gmail.com>
> wrote:
> >>>>>>>>
> >>>>>>>> > Consider the following sequence of events...
> >>>>>>>> an index with 2 segments (seg1 and seg2) originally created in
> Lucene 8.x.  ==> Upgrade to 9.x ==> index few documents and commit ==> seg3
> gets created with version 9.x, but merge doesn't kick in ==> documents in
> seg1 and seg2 get deleted followed by commit.==> You are left with seg3 in
> 9.x but indexCreatedVersionMajor as 8.x ==> Upgrade to Lucene 10.x fails.
> >>>>>>>>
> >>>>>>>> Thanks for the explanation. I am wondering if this is something
> that you commonly encounter, seems like a bit of an edge case?
> >>>>>>>>
> >>>>>>>> Regarding scenario 1, deleting the entire index and recreating it
> is generally faster and less resource intensive instead of deleting all the
> documents. Most systems built on top of Lucene like Solr, OpenSearch,
> Elasticsearch expose delete API for collection/index, and users just delete
> and recreate the index. Probably, one of the reasons it hasn't come up much
> before. Will let other community members chime in on this.
> >>>>>>>>
> >>>>>>>> On Sat, Apr 19, 2025 at 7:43 PM Rahul Goswami <
> rahul196...@gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>> For complete clarity..."minVersion" for a SegmentInfo is the min
> of the minVersions of all segments involved in the merge which resulted in
> this segment. If it is a "pure" segment, then minVersion=version.
> >>>>>>>>>
> >>>>>>>>> On Sat, Apr 19, 2025 at 10:35 PM Rahul Goswami <
> rahul196...@gmail.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Ankit,
> >>>>>>>>>> "I guess the SegmentInfo "minVersion" is the min across all
> segments during the merge process?"
> >>>>>>>>>> > That is correct
> >>>>>>>>>>
> >>>>>>>>>> I am wondering if there is any way to end up in the 2nd
> scenario, without having deleted all the documents first?
> >>>>>>>>>> > Consider the following sequence of events...
> >>>>>>>>>> an index with 2 segments (seg1 and seg2) originally created in
> Lucene 8.x.  ==> Upgrade to 9.x ==> index few documents and commit ==> seg3
> gets created with version 9.x, but merge doesn't kick in ==> documents in
> seg1 and seg2 get deleted followed by commit.==> You are left with seg3 in
> 9.x but indexCreatedVersionMajor as 8.x ==> Upgrade to Lucene 10.x fails.
> >>>>>>>>>>
> >>>>>>>>>> -Rahul
> >>>>>>>>>>
> >>>>>>>>>> On Sat, Apr 19, 2025 at 1:01 PM Ankit Jain <
> jain.ank...@gmail.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Rahul,
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks for starting this interesting discussion. I was
> initially thinking that this API potentially allows upgrading
> "indexCreatedVersionMajor" via the merge process after rewriting all the
> segments, but I guess the SegmentInfo "minVersion" is the min across all
> segments during the merge process?
> >>>>>>>>>>>
> >>>>>>>>>>> So, I am wondering if there is any way to end up in the 2nd
> scenario, without having deleted all the documents first?
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks
> >>>>>>>>>>> Ankit
> >>>>>>>>>>>
> >>>>>>>>>>> On Sat, Apr 19, 2025 at 9:17 AM Rahul Goswami <
> rahul196...@gmail.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hello,
> >>>>>>>>>>>> Today even after all documents in an index are deleted via an
> API call, reindexing still doesn't change the "indexCreatedVersionMajor"
> property value in SegmentInfos. Hence even after complete reindexing, an
> upgrade path X--> X+1 --> X+2 is still not possible as we end up with an
> IndexFormatTooOldException.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Requesting an API (on IndexWriter?) which can reset this
> property (upon a new commit) to the current Lucene version if:
> >>>>>>>>>>>> 1) No more live docs present
> >>>>>>>>>>>> OR
> >>>>>>>>>>>> 2) If all SegmentInfo in the index have a "minVersion" AND
> "version" stamp of the latest version , but SegmentInfos has an older
> "indexCreatedVersionMajor".
> >>>>>>>>>>>>
> >>>>>>>>>>>> This will help users a LOT since they can now interact with
> the index purely via API without needing manual deletion and also help open
> up a legitimate path to upgrade when an index doesn't HAVE to be
> repopulated from the source.
> >>>>>>>>>>>>
> >>>>>>>>>>>> If there is agreement, I am happy to pick this up and submit
> a PR.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>> Rahul Goswami
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Adrien
> >>
> >>
> >>
> >> --
> >> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Reply via email to