Robert, Thanks for chiming in and sharing your concerns. Even though it may seem risky at first, I do believe this can be done in a safe way where Lucene can have all the control to enforce the required checks-and-balances on the integrity of the index and avoid bogus bug reports.
For further consideration, I have created a pull request to concretize the idea here: https://github.com/apache/lucene/pull/14607 Would appreciate the thoughts of the committers on this. I am happy to work on filling any gaps to take this to closure or learn more if I am missing anything here. Regards, Rahul On Fri, Apr 25, 2025 at 1:36 AM Robert Muir <rcm...@gmail.com> wrote: > I'm strongly opposed to allowing code to change/reset this value. > Lucene needs to be able to defend itself from bogus bug reports for > this and ensure back compat really works. > > I instead support Mark Miller's proposal to only increase the > minimum-version when necessary on the lucene side: > https://github.com/apache/lucene/issues/13797 > > On Fri, Apr 25, 2025 at 1:24 AM Rahul Goswami <rahul196...@gmail.com> > wrote: > > > > Adrien, > > Appreciate your thoughts on this. I agree that the option of reindexing > into a separate Directory is less "surgical" and definitely something to > consider when the scale is manageable. Solr already has an API which can do > this for SolrCloud. However, in my view, for cases where all the source > fields are either stored or docValues true, the idea to support in-place > upgrade merits a consideration due to the following reasons: > > > > 1) A lot of users may not have the budget or bandwidth to allocate > additional resources to reindex into a parallel Directory. Especially in > cases where there are thousands of indexes across hundreds of nodes, this > process can quickly become cumbersome. This comes from a personal > experience which was a major driver for me building this solution for my > employer. > > > > 2) Search engines like Solr and Elasticsearch are often a piece in a > bigger commercial software offering. In case of deployments which are in > customer environments and not completely in control of the vendor, this > proposition of having to completely reindex the data on to a parallel > hardware can become a hard sell. > > > > 3) For install bases using Solr, Elasticsearch or OpenSearch, removing > the overhead of index upgrade really helps them to look at the search > engine as one homogenous piece of software that needs upgrading(when the > time comes) rather than having to account for the index separately. > > > > I understand that in cases where the data type for an existing field > changes (eg: Trie vs Point field as you mentioned), reindexing into a fresh > Directory is the only option. However in my experience, since 8.x I have > rarely seen this happen, and for such users I do see a way they can achieve > significant savings in terms of cost and effort by means of in-place > upgrading. > > > > Reindexing in-place can provide a solid alternative to the now retired > Lucene IndexUpgrader Tool as a means to effectively achieve a lossless > upgrade. It is the search engine's responsibility to build the guardrails > around the upgrade process(eg: ensuring that all source fields are either > stored or docValues true, or (maybe) disable updates via external API calls > while such a reindexing is in progress etc). But I do believe the effort is > worthy of consideration. > > > > Thanks, > > Rahul > > > > > > > > On Thu, Apr 24, 2025 at 3:06 AM Adrien Grand <jpou...@gmail.com> wrote: > >> > >> Hi Rahul, > >> > >> I'm still concerned about going with in-place upgrading instead of > reindexing in a separate directory. I don't mind opening an issue for > discussion, but I'd like the option of reindexing into a separate directory > to be considered. I think that it has lots of merits by avoiding multiple > versions of the same analyzer to be used in the same index, allowing > schemas to be upgraded (e.g. legacy GeoPoint -> LatLonPoint), etc. The > downsides that come to mind are that it puts a bit more work on the > application (rather than Lucene) and requires the upgrade to be done in a > somewhat timely manner to be practical (or replaying the delta since the > time when reindexing started may be heavy), but the trade-off still looks > in favor of reindexing to a separate Directory to me. > >> > >> On Thu, Apr 24, 2025 at 5:34 AM Rahul Goswami <rahul196...@gmail.com> > wrote: > >>> > >>> Hello, > >>> I wanted to circle back on this discussion before I go ahead with the > creation of a Github issue. If there are any additional/unanswered concerns > about the request, I am happy to elaborate further. > >>> Otherwise, I would like to go ahead and submit an issue and a PR for > further consideration. > >>> > >>> Thanks, > >>> Rahul > >>> > >>> On Sun, Apr 20, 2025 at 8:14 PM Rahul Goswami <rahul196...@gmail.com> > wrote: > >>>> > >>>> Adrien, > >>>> Thanks for your thoughts on this. To clarify, the request is not for > a Lucene API which will upgrade the index, as I understand it might not > always be possible to do a lossless upgrade of a cold index. > >>>> > >>>> The request is for an API which when called, will check the version > of each segment in the SegmentInfos, and if each of them *already* has the > latest version, change the index created version in SegmentInfos to the > latest. Some API on IndexWriter, say commitAndUpdateCreatedVersion(). > >>>> > >>>> None of the checks and balances or version compatibility restrictions > from Lucene's side have to change. We (Solr) only aim to support this kind > of reindexing between version X-1 and X. So as long as the schema adheres > to certain conditions, you'd be able to go from X-1 to X to X+1 without > needing to reindex from source, or provision infrastructure/effort for > reindexing to a parallel copy, and still have a completely lossless index. > >>>> > >>>> If it helps for further consideration, I am also happy to demonstrate > the implementation I have in mind for the API via a PR. > >>>> > >>>> - Rahul > >>>> > >>>> On Sun, Apr 20, 2025 at 4:12 AM Adrien Grand <jpou...@gmail.com> > wrote: > >>>>> > >>>>> I worry that allowing us to reset the index created version on an > existing index would only solve one part of the problem, while putting us > on a path that makes it harder or impossible to solve the other part of the > problem. > >>>>> > >>>>> If Lucene had had such an API from the beginning, we would still > need to deal with legacy stuff, such as the old way that norms were encoded > in the index ( > https://github.com/apache/lucene/blob/releases/lucene-solr/7.0.0/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java#L141-L145), > or trie fields ( > https://github.com/apache/lucene/blob/releases/lucene-solr/5.0.0/lucene/core/src/java/org/apache/lucene/document/IntField.java). > There currently is no way of upgrading norms or schemas in-place, and > supporting this sounds quite scary while I can't think of major downsides > of upgrading to a separate Directory and then atomically swapping pointers, > especially as Solr/Elasticsearch already have logic for replicating > operations to another copy of the data. > >>>>> > >>>>> This tells me that the upgraded index should not share state with > the previous index. And the upgrade process should take care of reindexing > to a new Directory while: > >>>>> - bumping the index creation version, > >>>>> - modernizing the schema if necessary (e.g. mapping "int" fields > from trie fields in the previous index to point fields in the new index, or > even sparse indexed fields in Lucene 10+), > >>>>> - upgrading norms and analysis chains if necessary. > >>>>> > >>>>> > >>>>> On Sun, Apr 20, 2025 at 8:47 AM Ankit Jain <jain.ank...@gmail.com> > wrote: > >>>>>> > >>>>>> Got it, thanks for providing additional context on the use case! > >>>>>> > >>>>>> On Sat, Apr 19, 2025 at 9:40 PM Rahul Goswami < > rahul196...@gmail.com> wrote: > >>>>>>> > >>>>>>> There is an effort underway in Apache Solr where we want to > provide a path to a legitimate upgrade without needing to reindex from > source: > >>>>>>> https://issues.apache.org/jira/browse/SOLR-17725 > >>>>>>> > >>>>>>> Essentially the proposal is to read documents from segments where > minVersion < current version and reindex them. At the same time, while the > process is underway, have a custom merge policy which would exclude such > segments from merging with latest version segments to prevent pollution. > >>>>>>> > >>>>>>> Result is an index which only contains segments with minVersion > and version stamps the same as the current Lucene version (essentially case > #2 that we discussed). This index would in all respects be an "upgraded" > index, but would need "indexCreatedVersionMajor" to be reset as well. This > is where the Lucene API (to reset "indexCreatedVersionMajor") becomes > essential. > >>>>>>> > >>>>>>> I believe this is a pattern which can also be adopted by other > Lucene based search engines like Opensearch and Elasticsearch, and hence > having this API could potentially benefit a large Lucene base. > >>>>>>> > >>>>>>> -Rahul > >>>>>>> > >>>>>>> On Sat, Apr 19, 2025 at 11:49 PM Ankit Jain <jain.ank...@gmail.com> > wrote: > >>>>>>>> > >>>>>>>> > Consider the following sequence of events... > >>>>>>>> an index with 2 segments (seg1 and seg2) originally created in > Lucene 8.x. ==> Upgrade to 9.x ==> index few documents and commit ==> seg3 > gets created with version 9.x, but merge doesn't kick in ==> documents in > seg1 and seg2 get deleted followed by commit.==> You are left with seg3 in > 9.x but indexCreatedVersionMajor as 8.x ==> Upgrade to Lucene 10.x fails. > >>>>>>>> > >>>>>>>> Thanks for the explanation. I am wondering if this is something > that you commonly encounter, seems like a bit of an edge case? > >>>>>>>> > >>>>>>>> Regarding scenario 1, deleting the entire index and recreating it > is generally faster and less resource intensive instead of deleting all the > documents. Most systems built on top of Lucene like Solr, OpenSearch, > Elasticsearch expose delete API for collection/index, and users just delete > and recreate the index. Probably, one of the reasons it hasn't come up much > before. Will let other community members chime in on this. > >>>>>>>> > >>>>>>>> On Sat, Apr 19, 2025 at 7:43 PM Rahul Goswami < > rahul196...@gmail.com> wrote: > >>>>>>>>> > >>>>>>>>> For complete clarity..."minVersion" for a SegmentInfo is the min > of the minVersions of all segments involved in the merge which resulted in > this segment. If it is a "pure" segment, then minVersion=version. > >>>>>>>>> > >>>>>>>>> On Sat, Apr 19, 2025 at 10:35 PM Rahul Goswami < > rahul196...@gmail.com> wrote: > >>>>>>>>>> > >>>>>>>>>> Ankit, > >>>>>>>>>> "I guess the SegmentInfo "minVersion" is the min across all > segments during the merge process?" > >>>>>>>>>> > That is correct > >>>>>>>>>> > >>>>>>>>>> I am wondering if there is any way to end up in the 2nd > scenario, without having deleted all the documents first? > >>>>>>>>>> > Consider the following sequence of events... > >>>>>>>>>> an index with 2 segments (seg1 and seg2) originally created in > Lucene 8.x. ==> Upgrade to 9.x ==> index few documents and commit ==> seg3 > gets created with version 9.x, but merge doesn't kick in ==> documents in > seg1 and seg2 get deleted followed by commit.==> You are left with seg3 in > 9.x but indexCreatedVersionMajor as 8.x ==> Upgrade to Lucene 10.x fails. > >>>>>>>>>> > >>>>>>>>>> -Rahul > >>>>>>>>>> > >>>>>>>>>> On Sat, Apr 19, 2025 at 1:01 PM Ankit Jain < > jain.ank...@gmail.com> wrote: > >>>>>>>>>>> > >>>>>>>>>>> Hi Rahul, > >>>>>>>>>>> > >>>>>>>>>>> Thanks for starting this interesting discussion. I was > initially thinking that this API potentially allows upgrading > "indexCreatedVersionMajor" via the merge process after rewriting all the > segments, but I guess the SegmentInfo "minVersion" is the min across all > segments during the merge process? > >>>>>>>>>>> > >>>>>>>>>>> So, I am wondering if there is any way to end up in the 2nd > scenario, without having deleted all the documents first? > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Thanks > >>>>>>>>>>> Ankit > >>>>>>>>>>> > >>>>>>>>>>> On Sat, Apr 19, 2025 at 9:17 AM Rahul Goswami < > rahul196...@gmail.com> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> Hello, > >>>>>>>>>>>> Today even after all documents in an index are deleted via an > API call, reindexing still doesn't change the "indexCreatedVersionMajor" > property value in SegmentInfos. Hence even after complete reindexing, an > upgrade path X--> X+1 --> X+2 is still not possible as we end up with an > IndexFormatTooOldException. > >>>>>>>>>>>> > >>>>>>>>>>>> Requesting an API (on IndexWriter?) which can reset this > property (upon a new commit) to the current Lucene version if: > >>>>>>>>>>>> 1) No more live docs present > >>>>>>>>>>>> OR > >>>>>>>>>>>> 2) If all SegmentInfo in the index have a "minVersion" AND > "version" stamp of the latest version , but SegmentInfos has an older > "indexCreatedVersionMajor". > >>>>>>>>>>>> > >>>>>>>>>>>> This will help users a LOT since they can now interact with > the index purely via API without needing manual deletion and also help open > up a legitimate path to upgrade when an index doesn't HAVE to be > repopulated from the source. > >>>>>>>>>>>> > >>>>>>>>>>>> If there is agreement, I am happy to pick this up and submit > a PR. > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks, > >>>>>>>>>>>> Rahul Goswami > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Adrien > >> > >> > >> > >> -- > >> Adrien > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >