The primary premise behind the API is that IF all segments of an index
already have a created version stamp of Version.LATEST (and also
probably SegmentInfo.minVersion=Version.LATEST ? ),  the index in all
respects is LATEST. "indexCreatedVersionMajor" should ideally not
block a Lucene upgrade in that case.

Rethinking the implementation, if we can achieve this without losing a
piece of information as conveyed by
SegmentInfos.indexCreatedVersionMajor today, that might be ideal. Here
is a potential alternative approach to achieving the same objective,
minus the concerns around updating the "indexCreatedVersionMajor"
property in SegmentInfos:

- Introduce another property in SegmentInfos, say "maxSupportedVersionMajor".
- Upon index creation (and for existing indexes?),
maxSupportedVersionMajor = indexCreatedVersionMajor + 1
- Expose an API "commitAndUpdateMaxSupportedVersionMajor()" as
requested in this discussion. This checks if all segments belong to
the LATEST version. If so, set maxSupportedVersionMajor=Version.LATEST
+ 1. The implementation of this API can be on similar lines as the PR
https://github.com/apache/lucene/pull/14607 (I will update this) to
ensure maximum control with IndexWriter in order to be able to do this
in a safe way.
- All checks in SegmentInfos and elsewhere which fail opening an index
based on indexCreatedVersionMajor , should instead be based on
maxSupportedVersionMajor.

Benefits of this approach:
- maxSupportedVersionMajor gives Lucene the flexibility to change the
policy for version compatibility without changing the validity check
logic
- We can also track how many "upgrades" an index has gone through
based on the difference between maxSupportedVersionMajor and
indexCreatedVersionMajor

Would appreciate any thoughts. Thanks.

- Rahul


On Sun, May 4, 2025 at 1:11 AM Rahul Goswami <rahul196...@gmail.com> wrote:
>
> Robert,
> Thanks for chiming in and sharing your concerns. Even though it may seem 
> risky at first, I do believe this can be done in a safe way where Lucene can 
> have all the control to enforce the required checks-and-balances on the 
> integrity of the index and avoid bogus bug reports.
>
> For further consideration, I have created a pull request to concretize the 
> idea here:
> https://github.com/apache/lucene/pull/14607
>
> Would appreciate the thoughts of the committers on this. I am happy to work 
> on filling any gaps to take this to closure or learn more if I am missing 
> anything here.
>
> Regards,
> Rahul
>
> On Fri, Apr 25, 2025 at 1:36 AM Robert Muir <rcm...@gmail.com> wrote:
>>
>> I'm strongly opposed to allowing code to change/reset this value.
>> Lucene needs to be able to defend itself from bogus bug reports for
>> this and ensure back compat really works.
>>
>> I instead support Mark Miller's proposal to only increase the
>> minimum-version when necessary on the lucene side:
>> https://github.com/apache/lucene/issues/13797
>>
>> On Fri, Apr 25, 2025 at 1:24 AM Rahul Goswami <rahul196...@gmail.com> wrote:
>> >
>> > Adrien,
>> > Appreciate your thoughts on this. I agree that the option of reindexing 
>> > into a separate Directory is less "surgical" and definitely something to 
>> > consider when the scale is manageable. Solr already has an API which can 
>> > do this for SolrCloud. However, in my view, for cases where all the source 
>> > fields are either stored or docValues true, the idea to support in-place 
>> > upgrade merits a consideration due to the following reasons:
>> >
>> > 1) A lot of users may not have the budget or bandwidth to allocate 
>> > additional resources to reindex into a parallel Directory. Especially in 
>> > cases where there are thousands of indexes across hundreds of nodes, this 
>> > process can quickly become cumbersome. This comes from a personal 
>> > experience which was a major driver for me building this solution for my 
>> > employer.
>> >
>> > 2) Search engines like Solr and Elasticsearch are often a piece in a 
>> > bigger commercial software offering. In case of deployments which are in 
>> > customer environments and not completely in control of the vendor, this 
>> > proposition of having to completely reindex the data on to a parallel 
>> > hardware can become a hard sell.
>> >
>> > 3) For install bases using Solr, Elasticsearch or OpenSearch, removing the 
>> > overhead of index upgrade really helps them to look at the search engine 
>> > as one homogenous piece of software that needs upgrading(when the time 
>> > comes) rather than having to account for the index separately.
>> >
>> > I understand that in cases where the data type for an existing field 
>> > changes (eg: Trie vs Point field as you mentioned), reindexing into a 
>> > fresh Directory is the only option. However in my experience, since 8.x I 
>> > have rarely seen this happen, and for such users I do see a way they can 
>> > achieve significant savings in terms of cost and effort by means of 
>> > in-place upgrading.
>> >
>> > Reindexing in-place can provide a solid alternative to the now retired 
>> > Lucene IndexUpgrader Tool as a means to effectively achieve a lossless 
>> > upgrade. It is the search engine's responsibility to build the guardrails 
>> > around the upgrade process(eg: ensuring that all source fields are either 
>> > stored or docValues true, or (maybe) disable updates via external API 
>> > calls while such a reindexing is in progress etc). But I do believe the 
>> > effort is worthy of consideration.
>> >
>> > Thanks,
>> > Rahul
>> >
>> >
>> >
>> > On Thu, Apr 24, 2025 at 3:06 AM Adrien Grand <jpou...@gmail.com> wrote:
>> >>
>> >> Hi Rahul,
>> >>
>> >> I'm still concerned about going with in-place upgrading instead of 
>> >> reindexing in a separate directory. I don't mind opening an issue for 
>> >> discussion, but I'd like the option of reindexing into a separate 
>> >> directory to be considered. I think that it has lots of merits by 
>> >> avoiding multiple versions of the same analyzer to be used in the same 
>> >> index, allowing schemas to be upgraded (e.g. legacy GeoPoint -> 
>> >> LatLonPoint), etc. The downsides that come to mind are that it puts a bit 
>> >> more work on the application (rather than Lucene) and requires the 
>> >> upgrade to be done in a somewhat timely manner to be practical (or 
>> >> replaying the delta since the time when reindexing started may be heavy), 
>> >> but the trade-off still looks in favor of reindexing to a separate 
>> >> Directory to me.
>> >>
>> >> On Thu, Apr 24, 2025 at 5:34 AM Rahul Goswami <rahul196...@gmail.com> 
>> >> wrote:
>> >>>
>> >>> Hello,
>> >>> I wanted to circle back on this discussion before I go ahead with the 
>> >>> creation of a Github issue. If there are any additional/unanswered 
>> >>> concerns about the request, I am happy to elaborate further.
>> >>> Otherwise, I would like to go ahead and submit an issue and a PR for 
>> >>> further consideration.
>> >>>
>> >>> Thanks,
>> >>> Rahul
>> >>>
>> >>> On Sun, Apr 20, 2025 at 8:14 PM Rahul Goswami <rahul196...@gmail.com> 
>> >>> wrote:
>> >>>>
>> >>>> Adrien,
>> >>>> Thanks for your thoughts on this. To clarify, the request is not for a 
>> >>>> Lucene API which will upgrade the index, as I understand it might not 
>> >>>> always be possible to do a lossless upgrade of a cold index.
>> >>>>
>> >>>> The request is for an API which when called, will check the version of 
>> >>>> each segment in the SegmentInfos, and if each of them *already* has the 
>> >>>> latest version, change the index created version in SegmentInfos to the 
>> >>>> latest. Some API on IndexWriter, say commitAndUpdateCreatedVersion().
>> >>>>
>> >>>> None of the checks and balances or version compatibility restrictions 
>> >>>> from Lucene's side have to change. We (Solr) only aim to support this 
>> >>>> kind of reindexing between version X-1 and X. So as long as the schema 
>> >>>> adheres to certain conditions, you'd be able to go from X-1 to X to X+1 
>> >>>> without needing to reindex from source, or provision 
>> >>>> infrastructure/effort for reindexing to a parallel copy, and still have 
>> >>>> a completely lossless index.
>> >>>>
>> >>>> If it helps for further consideration, I am also happy to demonstrate 
>> >>>> the implementation I have in mind for the API via a PR.
>> >>>>
>> >>>> - Rahul
>> >>>>
>> >>>> On Sun, Apr 20, 2025 at 4:12 AM Adrien Grand <jpou...@gmail.com> wrote:
>> >>>>>
>> >>>>> I worry that allowing us to reset the index created version on an 
>> >>>>> existing index would only solve one part of the problem, while putting 
>> >>>>> us on a path that makes it harder or impossible to solve the other 
>> >>>>> part of the problem.
>> >>>>>
>> >>>>> If Lucene had had such an API from the beginning, we would still need 
>> >>>>> to deal with legacy stuff, such as the old way that norms were encoded 
>> >>>>> in the index 
>> >>>>> (https://github.com/apache/lucene/blob/releases/lucene-solr/7.0.0/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java#L141-L145),
>> >>>>>  or trie fields 
>> >>>>> (https://github.com/apache/lucene/blob/releases/lucene-solr/5.0.0/lucene/core/src/java/org/apache/lucene/document/IntField.java).
>> >>>>>  There currently is no way of upgrading norms or schemas in-place, and 
>> >>>>> supporting this sounds quite scary while I can't think of major 
>> >>>>> downsides of upgrading to a separate Directory and then atomically 
>> >>>>> swapping pointers, especially as Solr/Elasticsearch already have logic 
>> >>>>> for replicating operations to another copy of the data.
>> >>>>>
>> >>>>> This tells me that the upgraded index should not share state with the 
>> >>>>> previous index. And the upgrade process should take care of reindexing 
>> >>>>> to a new Directory while:
>> >>>>>  - bumping the index creation version,
>> >>>>>  - modernizing the schema if necessary (e.g. mapping "int" fields from 
>> >>>>> trie fields in the previous index to point fields in the new index, or 
>> >>>>> even sparse indexed fields in Lucene 10+),
>> >>>>>  - upgrading norms and analysis chains if necessary.
>> >>>>>
>> >>>>>
>> >>>>> On Sun, Apr 20, 2025 at 8:47 AM Ankit Jain <jain.ank...@gmail.com> 
>> >>>>> wrote:
>> >>>>>>
>> >>>>>> Got it, thanks for providing additional context on the use case!
>> >>>>>>
>> >>>>>> On Sat, Apr 19, 2025 at 9:40 PM Rahul Goswami <rahul196...@gmail.com> 
>> >>>>>> wrote:
>> >>>>>>>
>> >>>>>>> There is an effort underway in Apache Solr where we want to provide 
>> >>>>>>> a path to a legitimate upgrade without needing to reindex from 
>> >>>>>>> source:
>> >>>>>>> https://issues.apache.org/jira/browse/SOLR-17725
>> >>>>>>>
>> >>>>>>> Essentially the proposal is to read documents from segments where 
>> >>>>>>> minVersion < current version and reindex them. At the same time, 
>> >>>>>>> while the process is underway,  have a custom merge policy which 
>> >>>>>>> would exclude such segments from merging with latest version 
>> >>>>>>> segments to prevent pollution.
>> >>>>>>>
>> >>>>>>> Result is an index which only contains segments with minVersion and 
>> >>>>>>> version stamps the same as the current Lucene version (essentially 
>> >>>>>>> case #2 that we discussed). This index would in all respects be an 
>> >>>>>>> "upgraded" index, but would need "indexCreatedVersionMajor" to be 
>> >>>>>>> reset as well. This is where the Lucene API (to reset 
>> >>>>>>> "indexCreatedVersionMajor") becomes essential.
>> >>>>>>>
>> >>>>>>> I believe this is a pattern which can also be adopted by other 
>> >>>>>>> Lucene based search engines like Opensearch and Elasticsearch, and 
>> >>>>>>> hence having this API could potentially benefit a large Lucene base.
>> >>>>>>>
>> >>>>>>> -Rahul
>> >>>>>>>
>> >>>>>>> On Sat, Apr 19, 2025 at 11:49 PM Ankit Jain <jain.ank...@gmail.com> 
>> >>>>>>> wrote:
>> >>>>>>>>
>> >>>>>>>> > Consider the following sequence of events...
>> >>>>>>>> an index with 2 segments (seg1 and seg2) originally created in 
>> >>>>>>>> Lucene 8.x.  ==> Upgrade to 9.x ==> index few documents and commit 
>> >>>>>>>> ==> seg3 gets created with version 9.x, but merge doesn't kick in 
>> >>>>>>>> ==> documents in seg1 and seg2 get deleted followed by commit.==> 
>> >>>>>>>> You are left with seg3 in 9.x but indexCreatedVersionMajor as 8.x 
>> >>>>>>>> ==> Upgrade to Lucene 10.x fails.
>> >>>>>>>>
>> >>>>>>>> Thanks for the explanation. I am wondering if this is something 
>> >>>>>>>> that you commonly encounter, seems like a bit of an edge case?
>> >>>>>>>>
>> >>>>>>>> Regarding scenario 1, deleting the entire index and recreating it 
>> >>>>>>>> is generally faster and less resource intensive instead of deleting 
>> >>>>>>>> all the documents. Most systems built on top of Lucene like Solr, 
>> >>>>>>>> OpenSearch, Elasticsearch expose delete API for collection/index, 
>> >>>>>>>> and users just delete and recreate the index. Probably, one of the 
>> >>>>>>>> reasons it hasn't come up much before. Will let other community 
>> >>>>>>>> members chime in on this.
>> >>>>>>>>
>> >>>>>>>> On Sat, Apr 19, 2025 at 7:43 PM Rahul Goswami 
>> >>>>>>>> <rahul196...@gmail.com> wrote:
>> >>>>>>>>>
>> >>>>>>>>> For complete clarity..."minVersion" for a SegmentInfo is the min 
>> >>>>>>>>> of the minVersions of all segments involved in the merge which 
>> >>>>>>>>> resulted in this segment. If it is a "pure" segment, then 
>> >>>>>>>>> minVersion=version.
>> >>>>>>>>>
>> >>>>>>>>> On Sat, Apr 19, 2025 at 10:35 PM Rahul Goswami 
>> >>>>>>>>> <rahul196...@gmail.com> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> Ankit,
>> >>>>>>>>>> "I guess the SegmentInfo "minVersion" is the min across all 
>> >>>>>>>>>> segments during the merge process?"
>> >>>>>>>>>> > That is correct
>> >>>>>>>>>>
>> >>>>>>>>>> I am wondering if there is any way to end up in the 2nd scenario, 
>> >>>>>>>>>> without having deleted all the documents first?
>> >>>>>>>>>> > Consider the following sequence of events...
>> >>>>>>>>>> an index with 2 segments (seg1 and seg2) originally created in 
>> >>>>>>>>>> Lucene 8.x.  ==> Upgrade to 9.x ==> index few documents and 
>> >>>>>>>>>> commit ==> seg3 gets created with version 9.x, but merge doesn't 
>> >>>>>>>>>> kick in ==> documents in seg1 and seg2 get deleted followed by 
>> >>>>>>>>>> commit.==> You are left with seg3 in 9.x but 
>> >>>>>>>>>> indexCreatedVersionMajor as 8.x ==> Upgrade to Lucene 10.x fails.
>> >>>>>>>>>>
>> >>>>>>>>>> -Rahul
>> >>>>>>>>>>
>> >>>>>>>>>> On Sat, Apr 19, 2025 at 1:01 PM Ankit Jain 
>> >>>>>>>>>> <jain.ank...@gmail.com> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>> Hi Rahul,
>> >>>>>>>>>>>
>> >>>>>>>>>>> Thanks for starting this interesting discussion. I was initially 
>> >>>>>>>>>>> thinking that this API potentially allows upgrading 
>> >>>>>>>>>>> "indexCreatedVersionMajor" via the merge process after rewriting 
>> >>>>>>>>>>> all the segments, but I guess the SegmentInfo "minVersion" is 
>> >>>>>>>>>>> the min across all segments during the merge process?
>> >>>>>>>>>>>
>> >>>>>>>>>>> So, I am wondering if there is any way to end up in the 2nd 
>> >>>>>>>>>>> scenario, without having deleted all the documents first?
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Thanks
>> >>>>>>>>>>> Ankit
>> >>>>>>>>>>>
>> >>>>>>>>>>> On Sat, Apr 19, 2025 at 9:17 AM Rahul Goswami 
>> >>>>>>>>>>> <rahul196...@gmail.com> wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Hello,
>> >>>>>>>>>>>> Today even after all documents in an index are deleted via an 
>> >>>>>>>>>>>> API call, reindexing still doesn't change the 
>> >>>>>>>>>>>> "indexCreatedVersionMajor" property value in SegmentInfos. 
>> >>>>>>>>>>>> Hence even after complete reindexing, an upgrade path X--> X+1 
>> >>>>>>>>>>>> --> X+2 is still not possible as we end up with an 
>> >>>>>>>>>>>> IndexFormatTooOldException.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Requesting an API (on IndexWriter?) which can reset this 
>> >>>>>>>>>>>> property (upon a new commit) to the current Lucene version if:
>> >>>>>>>>>>>> 1) No more live docs present
>> >>>>>>>>>>>> OR
>> >>>>>>>>>>>> 2) If all SegmentInfo in the index have a "minVersion" AND 
>> >>>>>>>>>>>> "version" stamp of the latest version , but SegmentInfos has an 
>> >>>>>>>>>>>> older "indexCreatedVersionMajor".
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> This will help users a LOT since they can now interact with the 
>> >>>>>>>>>>>> index purely via API without needing manual deletion and also 
>> >>>>>>>>>>>> help open up a legitimate path to upgrade when an index doesn't 
>> >>>>>>>>>>>> HAVE to be repopulated from the source.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> If there is agreement, I am happy to pick this up and submit a 
>> >>>>>>>>>>>> PR.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>> Rahul Goswami
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> Adrien
>> >>
>> >>
>> >>
>> >> --
>> >> Adrien
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to