The primary premise behind the API is that IF all segments of an index already have a created version stamp of Version.LATEST (and also probably SegmentInfo.minVersion=Version.LATEST ? ), the index in all respects is LATEST. "indexCreatedVersionMajor" should ideally not block a Lucene upgrade in that case.
Rethinking the implementation, if we can achieve this without losing a piece of information as conveyed by SegmentInfos.indexCreatedVersionMajor today, that might be ideal. Here is a potential alternative approach to achieving the same objective, minus the concerns around updating the "indexCreatedVersionMajor" property in SegmentInfos: - Introduce another property in SegmentInfos, say "maxSupportedVersionMajor". - Upon index creation (and for existing indexes?), maxSupportedVersionMajor = indexCreatedVersionMajor + 1 - Expose an API "commitAndUpdateMaxSupportedVersionMajor()" as requested in this discussion. This checks if all segments belong to the LATEST version. If so, set maxSupportedVersionMajor=Version.LATEST + 1. The implementation of this API can be on similar lines as the PR https://github.com/apache/lucene/pull/14607 (I will update this) to ensure maximum control with IndexWriter in order to be able to do this in a safe way. - All checks in SegmentInfos and elsewhere which fail opening an index based on indexCreatedVersionMajor , should instead be based on maxSupportedVersionMajor. Benefits of this approach: - maxSupportedVersionMajor gives Lucene the flexibility to change the policy for version compatibility without changing the validity check logic - We can also track how many "upgrades" an index has gone through based on the difference between maxSupportedVersionMajor and indexCreatedVersionMajor Would appreciate any thoughts. Thanks. - Rahul On Sun, May 4, 2025 at 1:11 AM Rahul Goswami <rahul196...@gmail.com> wrote: > > Robert, > Thanks for chiming in and sharing your concerns. Even though it may seem > risky at first, I do believe this can be done in a safe way where Lucene can > have all the control to enforce the required checks-and-balances on the > integrity of the index and avoid bogus bug reports. > > For further consideration, I have created a pull request to concretize the > idea here: > https://github.com/apache/lucene/pull/14607 > > Would appreciate the thoughts of the committers on this. I am happy to work > on filling any gaps to take this to closure or learn more if I am missing > anything here. > > Regards, > Rahul > > On Fri, Apr 25, 2025 at 1:36 AM Robert Muir <rcm...@gmail.com> wrote: >> >> I'm strongly opposed to allowing code to change/reset this value. >> Lucene needs to be able to defend itself from bogus bug reports for >> this and ensure back compat really works. >> >> I instead support Mark Miller's proposal to only increase the >> minimum-version when necessary on the lucene side: >> https://github.com/apache/lucene/issues/13797 >> >> On Fri, Apr 25, 2025 at 1:24 AM Rahul Goswami <rahul196...@gmail.com> wrote: >> > >> > Adrien, >> > Appreciate your thoughts on this. I agree that the option of reindexing >> > into a separate Directory is less "surgical" and definitely something to >> > consider when the scale is manageable. Solr already has an API which can >> > do this for SolrCloud. However, in my view, for cases where all the source >> > fields are either stored or docValues true, the idea to support in-place >> > upgrade merits a consideration due to the following reasons: >> > >> > 1) A lot of users may not have the budget or bandwidth to allocate >> > additional resources to reindex into a parallel Directory. Especially in >> > cases where there are thousands of indexes across hundreds of nodes, this >> > process can quickly become cumbersome. This comes from a personal >> > experience which was a major driver for me building this solution for my >> > employer. >> > >> > 2) Search engines like Solr and Elasticsearch are often a piece in a >> > bigger commercial software offering. In case of deployments which are in >> > customer environments and not completely in control of the vendor, this >> > proposition of having to completely reindex the data on to a parallel >> > hardware can become a hard sell. >> > >> > 3) For install bases using Solr, Elasticsearch or OpenSearch, removing the >> > overhead of index upgrade really helps them to look at the search engine >> > as one homogenous piece of software that needs upgrading(when the time >> > comes) rather than having to account for the index separately. >> > >> > I understand that in cases where the data type for an existing field >> > changes (eg: Trie vs Point field as you mentioned), reindexing into a >> > fresh Directory is the only option. However in my experience, since 8.x I >> > have rarely seen this happen, and for such users I do see a way they can >> > achieve significant savings in terms of cost and effort by means of >> > in-place upgrading. >> > >> > Reindexing in-place can provide a solid alternative to the now retired >> > Lucene IndexUpgrader Tool as a means to effectively achieve a lossless >> > upgrade. It is the search engine's responsibility to build the guardrails >> > around the upgrade process(eg: ensuring that all source fields are either >> > stored or docValues true, or (maybe) disable updates via external API >> > calls while such a reindexing is in progress etc). But I do believe the >> > effort is worthy of consideration. >> > >> > Thanks, >> > Rahul >> > >> > >> > >> > On Thu, Apr 24, 2025 at 3:06 AM Adrien Grand <jpou...@gmail.com> wrote: >> >> >> >> Hi Rahul, >> >> >> >> I'm still concerned about going with in-place upgrading instead of >> >> reindexing in a separate directory. I don't mind opening an issue for >> >> discussion, but I'd like the option of reindexing into a separate >> >> directory to be considered. I think that it has lots of merits by >> >> avoiding multiple versions of the same analyzer to be used in the same >> >> index, allowing schemas to be upgraded (e.g. legacy GeoPoint -> >> >> LatLonPoint), etc. The downsides that come to mind are that it puts a bit >> >> more work on the application (rather than Lucene) and requires the >> >> upgrade to be done in a somewhat timely manner to be practical (or >> >> replaying the delta since the time when reindexing started may be heavy), >> >> but the trade-off still looks in favor of reindexing to a separate >> >> Directory to me. >> >> >> >> On Thu, Apr 24, 2025 at 5:34 AM Rahul Goswami <rahul196...@gmail.com> >> >> wrote: >> >>> >> >>> Hello, >> >>> I wanted to circle back on this discussion before I go ahead with the >> >>> creation of a Github issue. If there are any additional/unanswered >> >>> concerns about the request, I am happy to elaborate further. >> >>> Otherwise, I would like to go ahead and submit an issue and a PR for >> >>> further consideration. >> >>> >> >>> Thanks, >> >>> Rahul >> >>> >> >>> On Sun, Apr 20, 2025 at 8:14 PM Rahul Goswami <rahul196...@gmail.com> >> >>> wrote: >> >>>> >> >>>> Adrien, >> >>>> Thanks for your thoughts on this. To clarify, the request is not for a >> >>>> Lucene API which will upgrade the index, as I understand it might not >> >>>> always be possible to do a lossless upgrade of a cold index. >> >>>> >> >>>> The request is for an API which when called, will check the version of >> >>>> each segment in the SegmentInfos, and if each of them *already* has the >> >>>> latest version, change the index created version in SegmentInfos to the >> >>>> latest. Some API on IndexWriter, say commitAndUpdateCreatedVersion(). >> >>>> >> >>>> None of the checks and balances or version compatibility restrictions >> >>>> from Lucene's side have to change. We (Solr) only aim to support this >> >>>> kind of reindexing between version X-1 and X. So as long as the schema >> >>>> adheres to certain conditions, you'd be able to go from X-1 to X to X+1 >> >>>> without needing to reindex from source, or provision >> >>>> infrastructure/effort for reindexing to a parallel copy, and still have >> >>>> a completely lossless index. >> >>>> >> >>>> If it helps for further consideration, I am also happy to demonstrate >> >>>> the implementation I have in mind for the API via a PR. >> >>>> >> >>>> - Rahul >> >>>> >> >>>> On Sun, Apr 20, 2025 at 4:12 AM Adrien Grand <jpou...@gmail.com> wrote: >> >>>>> >> >>>>> I worry that allowing us to reset the index created version on an >> >>>>> existing index would only solve one part of the problem, while putting >> >>>>> us on a path that makes it harder or impossible to solve the other >> >>>>> part of the problem. >> >>>>> >> >>>>> If Lucene had had such an API from the beginning, we would still need >> >>>>> to deal with legacy stuff, such as the old way that norms were encoded >> >>>>> in the index >> >>>>> (https://github.com/apache/lucene/blob/releases/lucene-solr/7.0.0/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java#L141-L145), >> >>>>> or trie fields >> >>>>> (https://github.com/apache/lucene/blob/releases/lucene-solr/5.0.0/lucene/core/src/java/org/apache/lucene/document/IntField.java). >> >>>>> There currently is no way of upgrading norms or schemas in-place, and >> >>>>> supporting this sounds quite scary while I can't think of major >> >>>>> downsides of upgrading to a separate Directory and then atomically >> >>>>> swapping pointers, especially as Solr/Elasticsearch already have logic >> >>>>> for replicating operations to another copy of the data. >> >>>>> >> >>>>> This tells me that the upgraded index should not share state with the >> >>>>> previous index. And the upgrade process should take care of reindexing >> >>>>> to a new Directory while: >> >>>>> - bumping the index creation version, >> >>>>> - modernizing the schema if necessary (e.g. mapping "int" fields from >> >>>>> trie fields in the previous index to point fields in the new index, or >> >>>>> even sparse indexed fields in Lucene 10+), >> >>>>> - upgrading norms and analysis chains if necessary. >> >>>>> >> >>>>> >> >>>>> On Sun, Apr 20, 2025 at 8:47 AM Ankit Jain <jain.ank...@gmail.com> >> >>>>> wrote: >> >>>>>> >> >>>>>> Got it, thanks for providing additional context on the use case! >> >>>>>> >> >>>>>> On Sat, Apr 19, 2025 at 9:40 PM Rahul Goswami <rahul196...@gmail.com> >> >>>>>> wrote: >> >>>>>>> >> >>>>>>> There is an effort underway in Apache Solr where we want to provide >> >>>>>>> a path to a legitimate upgrade without needing to reindex from >> >>>>>>> source: >> >>>>>>> https://issues.apache.org/jira/browse/SOLR-17725 >> >>>>>>> >> >>>>>>> Essentially the proposal is to read documents from segments where >> >>>>>>> minVersion < current version and reindex them. At the same time, >> >>>>>>> while the process is underway, have a custom merge policy which >> >>>>>>> would exclude such segments from merging with latest version >> >>>>>>> segments to prevent pollution. >> >>>>>>> >> >>>>>>> Result is an index which only contains segments with minVersion and >> >>>>>>> version stamps the same as the current Lucene version (essentially >> >>>>>>> case #2 that we discussed). This index would in all respects be an >> >>>>>>> "upgraded" index, but would need "indexCreatedVersionMajor" to be >> >>>>>>> reset as well. This is where the Lucene API (to reset >> >>>>>>> "indexCreatedVersionMajor") becomes essential. >> >>>>>>> >> >>>>>>> I believe this is a pattern which can also be adopted by other >> >>>>>>> Lucene based search engines like Opensearch and Elasticsearch, and >> >>>>>>> hence having this API could potentially benefit a large Lucene base. >> >>>>>>> >> >>>>>>> -Rahul >> >>>>>>> >> >>>>>>> On Sat, Apr 19, 2025 at 11:49 PM Ankit Jain <jain.ank...@gmail.com> >> >>>>>>> wrote: >> >>>>>>>> >> >>>>>>>> > Consider the following sequence of events... >> >>>>>>>> an index with 2 segments (seg1 and seg2) originally created in >> >>>>>>>> Lucene 8.x. ==> Upgrade to 9.x ==> index few documents and commit >> >>>>>>>> ==> seg3 gets created with version 9.x, but merge doesn't kick in >> >>>>>>>> ==> documents in seg1 and seg2 get deleted followed by commit.==> >> >>>>>>>> You are left with seg3 in 9.x but indexCreatedVersionMajor as 8.x >> >>>>>>>> ==> Upgrade to Lucene 10.x fails. >> >>>>>>>> >> >>>>>>>> Thanks for the explanation. I am wondering if this is something >> >>>>>>>> that you commonly encounter, seems like a bit of an edge case? >> >>>>>>>> >> >>>>>>>> Regarding scenario 1, deleting the entire index and recreating it >> >>>>>>>> is generally faster and less resource intensive instead of deleting >> >>>>>>>> all the documents. Most systems built on top of Lucene like Solr, >> >>>>>>>> OpenSearch, Elasticsearch expose delete API for collection/index, >> >>>>>>>> and users just delete and recreate the index. Probably, one of the >> >>>>>>>> reasons it hasn't come up much before. Will let other community >> >>>>>>>> members chime in on this. >> >>>>>>>> >> >>>>>>>> On Sat, Apr 19, 2025 at 7:43 PM Rahul Goswami >> >>>>>>>> <rahul196...@gmail.com> wrote: >> >>>>>>>>> >> >>>>>>>>> For complete clarity..."minVersion" for a SegmentInfo is the min >> >>>>>>>>> of the minVersions of all segments involved in the merge which >> >>>>>>>>> resulted in this segment. If it is a "pure" segment, then >> >>>>>>>>> minVersion=version. >> >>>>>>>>> >> >>>>>>>>> On Sat, Apr 19, 2025 at 10:35 PM Rahul Goswami >> >>>>>>>>> <rahul196...@gmail.com> wrote: >> >>>>>>>>>> >> >>>>>>>>>> Ankit, >> >>>>>>>>>> "I guess the SegmentInfo "minVersion" is the min across all >> >>>>>>>>>> segments during the merge process?" >> >>>>>>>>>> > That is correct >> >>>>>>>>>> >> >>>>>>>>>> I am wondering if there is any way to end up in the 2nd scenario, >> >>>>>>>>>> without having deleted all the documents first? >> >>>>>>>>>> > Consider the following sequence of events... >> >>>>>>>>>> an index with 2 segments (seg1 and seg2) originally created in >> >>>>>>>>>> Lucene 8.x. ==> Upgrade to 9.x ==> index few documents and >> >>>>>>>>>> commit ==> seg3 gets created with version 9.x, but merge doesn't >> >>>>>>>>>> kick in ==> documents in seg1 and seg2 get deleted followed by >> >>>>>>>>>> commit.==> You are left with seg3 in 9.x but >> >>>>>>>>>> indexCreatedVersionMajor as 8.x ==> Upgrade to Lucene 10.x fails. >> >>>>>>>>>> >> >>>>>>>>>> -Rahul >> >>>>>>>>>> >> >>>>>>>>>> On Sat, Apr 19, 2025 at 1:01 PM Ankit Jain >> >>>>>>>>>> <jain.ank...@gmail.com> wrote: >> >>>>>>>>>>> >> >>>>>>>>>>> Hi Rahul, >> >>>>>>>>>>> >> >>>>>>>>>>> Thanks for starting this interesting discussion. I was initially >> >>>>>>>>>>> thinking that this API potentially allows upgrading >> >>>>>>>>>>> "indexCreatedVersionMajor" via the merge process after rewriting >> >>>>>>>>>>> all the segments, but I guess the SegmentInfo "minVersion" is >> >>>>>>>>>>> the min across all segments during the merge process? >> >>>>>>>>>>> >> >>>>>>>>>>> So, I am wondering if there is any way to end up in the 2nd >> >>>>>>>>>>> scenario, without having deleted all the documents first? >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> Thanks >> >>>>>>>>>>> Ankit >> >>>>>>>>>>> >> >>>>>>>>>>> On Sat, Apr 19, 2025 at 9:17 AM Rahul Goswami >> >>>>>>>>>>> <rahul196...@gmail.com> wrote: >> >>>>>>>>>>>> >> >>>>>>>>>>>> Hello, >> >>>>>>>>>>>> Today even after all documents in an index are deleted via an >> >>>>>>>>>>>> API call, reindexing still doesn't change the >> >>>>>>>>>>>> "indexCreatedVersionMajor" property value in SegmentInfos. >> >>>>>>>>>>>> Hence even after complete reindexing, an upgrade path X--> X+1 >> >>>>>>>>>>>> --> X+2 is still not possible as we end up with an >> >>>>>>>>>>>> IndexFormatTooOldException. >> >>>>>>>>>>>> >> >>>>>>>>>>>> Requesting an API (on IndexWriter?) which can reset this >> >>>>>>>>>>>> property (upon a new commit) to the current Lucene version if: >> >>>>>>>>>>>> 1) No more live docs present >> >>>>>>>>>>>> OR >> >>>>>>>>>>>> 2) If all SegmentInfo in the index have a "minVersion" AND >> >>>>>>>>>>>> "version" stamp of the latest version , but SegmentInfos has an >> >>>>>>>>>>>> older "indexCreatedVersionMajor". >> >>>>>>>>>>>> >> >>>>>>>>>>>> This will help users a LOT since they can now interact with the >> >>>>>>>>>>>> index purely via API without needing manual deletion and also >> >>>>>>>>>>>> help open up a legitimate path to upgrade when an index doesn't >> >>>>>>>>>>>> HAVE to be repopulated from the source. >> >>>>>>>>>>>> >> >>>>>>>>>>>> If there is agreement, I am happy to pick this up and submit a >> >>>>>>>>>>>> PR. >> >>>>>>>>>>>> >> >>>>>>>>>>>> Thanks, >> >>>>>>>>>>>> Rahul Goswami >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>> >> >>>>> >> >>>>> -- >> >>>>> Adrien >> >> >> >> >> >> >> >> -- >> >> Adrien >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org