I'm strongly opposed to allowing code to change/reset this value. Lucene needs to be able to defend itself from bogus bug reports for this and ensure back compat really works.
I instead support Mark Miller's proposal to only increase the minimum-version when necessary on the lucene side: https://github.com/apache/lucene/issues/13797 On Fri, Apr 25, 2025 at 1:24 AM Rahul Goswami <rahul196...@gmail.com> wrote: > > Adrien, > Appreciate your thoughts on this. I agree that the option of reindexing into > a separate Directory is less "surgical" and definitely something to consider > when the scale is manageable. Solr already has an API which can do this for > SolrCloud. However, in my view, for cases where all the source fields are > either stored or docValues true, the idea to support in-place upgrade merits > a consideration due to the following reasons: > > 1) A lot of users may not have the budget or bandwidth to allocate additional > resources to reindex into a parallel Directory. Especially in cases where > there are thousands of indexes across hundreds of nodes, this process can > quickly become cumbersome. This comes from a personal experience which was a > major driver for me building this solution for my employer. > > 2) Search engines like Solr and Elasticsearch are often a piece in a bigger > commercial software offering. In case of deployments which are in customer > environments and not completely in control of the vendor, this proposition of > having to completely reindex the data on to a parallel hardware can become a > hard sell. > > 3) For install bases using Solr, Elasticsearch or OpenSearch, removing the > overhead of index upgrade really helps them to look at the search engine as > one homogenous piece of software that needs upgrading(when the time comes) > rather than having to account for the index separately. > > I understand that in cases where the data type for an existing field changes > (eg: Trie vs Point field as you mentioned), reindexing into a fresh Directory > is the only option. However in my experience, since 8.x I have rarely seen > this happen, and for such users I do see a way they can achieve significant > savings in terms of cost and effort by means of in-place upgrading. > > Reindexing in-place can provide a solid alternative to the now retired Lucene > IndexUpgrader Tool as a means to effectively achieve a lossless upgrade. It > is the search engine's responsibility to build the guardrails around the > upgrade process(eg: ensuring that all source fields are either stored or > docValues true, or (maybe) disable updates via external API calls while such > a reindexing is in progress etc). But I do believe the effort is worthy of > consideration. > > Thanks, > Rahul > > > > On Thu, Apr 24, 2025 at 3:06 AM Adrien Grand <jpou...@gmail.com> wrote: >> >> Hi Rahul, >> >> I'm still concerned about going with in-place upgrading instead of >> reindexing in a separate directory. I don't mind opening an issue for >> discussion, but I'd like the option of reindexing into a separate directory >> to be considered. I think that it has lots of merits by avoiding multiple >> versions of the same analyzer to be used in the same index, allowing schemas >> to be upgraded (e.g. legacy GeoPoint -> LatLonPoint), etc. The downsides >> that come to mind are that it puts a bit more work on the application >> (rather than Lucene) and requires the upgrade to be done in a somewhat >> timely manner to be practical (or replaying the delta since the time when >> reindexing started may be heavy), but the trade-off still looks in favor of >> reindexing to a separate Directory to me. >> >> On Thu, Apr 24, 2025 at 5:34 AM Rahul Goswami <rahul196...@gmail.com> wrote: >>> >>> Hello, >>> I wanted to circle back on this discussion before I go ahead with the >>> creation of a Github issue. If there are any additional/unanswered concerns >>> about the request, I am happy to elaborate further. >>> Otherwise, I would like to go ahead and submit an issue and a PR for >>> further consideration. >>> >>> Thanks, >>> Rahul >>> >>> On Sun, Apr 20, 2025 at 8:14 PM Rahul Goswami <rahul196...@gmail.com> wrote: >>>> >>>> Adrien, >>>> Thanks for your thoughts on this. To clarify, the request is not for a >>>> Lucene API which will upgrade the index, as I understand it might not >>>> always be possible to do a lossless upgrade of a cold index. >>>> >>>> The request is for an API which when called, will check the version of >>>> each segment in the SegmentInfos, and if each of them *already* has the >>>> latest version, change the index created version in SegmentInfos to the >>>> latest. Some API on IndexWriter, say commitAndUpdateCreatedVersion(). >>>> >>>> None of the checks and balances or version compatibility restrictions from >>>> Lucene's side have to change. We (Solr) only aim to support this kind of >>>> reindexing between version X-1 and X. So as long as the schema adheres to >>>> certain conditions, you'd be able to go from X-1 to X to X+1 without >>>> needing to reindex from source, or provision infrastructure/effort for >>>> reindexing to a parallel copy, and still have a completely lossless index. >>>> >>>> If it helps for further consideration, I am also happy to demonstrate the >>>> implementation I have in mind for the API via a PR. >>>> >>>> - Rahul >>>> >>>> On Sun, Apr 20, 2025 at 4:12 AM Adrien Grand <jpou...@gmail.com> wrote: >>>>> >>>>> I worry that allowing us to reset the index created version on an >>>>> existing index would only solve one part of the problem, while putting us >>>>> on a path that makes it harder or impossible to solve the other part of >>>>> the problem. >>>>> >>>>> If Lucene had had such an API from the beginning, we would still need to >>>>> deal with legacy stuff, such as the old way that norms were encoded in >>>>> the index >>>>> (https://github.com/apache/lucene/blob/releases/lucene-solr/7.0.0/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java#L141-L145), >>>>> or trie fields >>>>> (https://github.com/apache/lucene/blob/releases/lucene-solr/5.0.0/lucene/core/src/java/org/apache/lucene/document/IntField.java). >>>>> There currently is no way of upgrading norms or schemas in-place, and >>>>> supporting this sounds quite scary while I can't think of major downsides >>>>> of upgrading to a separate Directory and then atomically swapping >>>>> pointers, especially as Solr/Elasticsearch already have logic for >>>>> replicating operations to another copy of the data. >>>>> >>>>> This tells me that the upgraded index should not share state with the >>>>> previous index. And the upgrade process should take care of reindexing to >>>>> a new Directory while: >>>>> - bumping the index creation version, >>>>> - modernizing the schema if necessary (e.g. mapping "int" fields from >>>>> trie fields in the previous index to point fields in the new index, or >>>>> even sparse indexed fields in Lucene 10+), >>>>> - upgrading norms and analysis chains if necessary. >>>>> >>>>> >>>>> On Sun, Apr 20, 2025 at 8:47 AM Ankit Jain <jain.ank...@gmail.com> wrote: >>>>>> >>>>>> Got it, thanks for providing additional context on the use case! >>>>>> >>>>>> On Sat, Apr 19, 2025 at 9:40 PM Rahul Goswami <rahul196...@gmail.com> >>>>>> wrote: >>>>>>> >>>>>>> There is an effort underway in Apache Solr where we want to provide a >>>>>>> path to a legitimate upgrade without needing to reindex from source: >>>>>>> https://issues.apache.org/jira/browse/SOLR-17725 >>>>>>> >>>>>>> Essentially the proposal is to read documents from segments where >>>>>>> minVersion < current version and reindex them. At the same time, while >>>>>>> the process is underway, have a custom merge policy which would >>>>>>> exclude such segments from merging with latest version segments to >>>>>>> prevent pollution. >>>>>>> >>>>>>> Result is an index which only contains segments with minVersion and >>>>>>> version stamps the same as the current Lucene version (essentially case >>>>>>> #2 that we discussed). This index would in all respects be an >>>>>>> "upgraded" index, but would need "indexCreatedVersionMajor" to be reset >>>>>>> as well. This is where the Lucene API (to reset >>>>>>> "indexCreatedVersionMajor") becomes essential. >>>>>>> >>>>>>> I believe this is a pattern which can also be adopted by other Lucene >>>>>>> based search engines like Opensearch and Elasticsearch, and hence >>>>>>> having this API could potentially benefit a large Lucene base. >>>>>>> >>>>>>> -Rahul >>>>>>> >>>>>>> On Sat, Apr 19, 2025 at 11:49 PM Ankit Jain <jain.ank...@gmail.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> > Consider the following sequence of events... >>>>>>>> an index with 2 segments (seg1 and seg2) originally created in Lucene >>>>>>>> 8.x. ==> Upgrade to 9.x ==> index few documents and commit ==> seg3 >>>>>>>> gets created with version 9.x, but merge doesn't kick in ==> documents >>>>>>>> in seg1 and seg2 get deleted followed by commit.==> You are left with >>>>>>>> seg3 in 9.x but indexCreatedVersionMajor as 8.x ==> Upgrade to Lucene >>>>>>>> 10.x fails. >>>>>>>> >>>>>>>> Thanks for the explanation. I am wondering if this is something that >>>>>>>> you commonly encounter, seems like a bit of an edge case? >>>>>>>> >>>>>>>> Regarding scenario 1, deleting the entire index and recreating it is >>>>>>>> generally faster and less resource intensive instead of deleting all >>>>>>>> the documents. Most systems built on top of Lucene like Solr, >>>>>>>> OpenSearch, Elasticsearch expose delete API for collection/index, and >>>>>>>> users just delete and recreate the index. Probably, one of the reasons >>>>>>>> it hasn't come up much before. Will let other community members chime >>>>>>>> in on this. >>>>>>>> >>>>>>>> On Sat, Apr 19, 2025 at 7:43 PM Rahul Goswami <rahul196...@gmail.com> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> For complete clarity..."minVersion" for a SegmentInfo is the min of >>>>>>>>> the minVersions of all segments involved in the merge which resulted >>>>>>>>> in this segment. If it is a "pure" segment, then minVersion=version. >>>>>>>>> >>>>>>>>> On Sat, Apr 19, 2025 at 10:35 PM Rahul Goswami >>>>>>>>> <rahul196...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>> Ankit, >>>>>>>>>> "I guess the SegmentInfo "minVersion" is the min across all segments >>>>>>>>>> during the merge process?" >>>>>>>>>> > That is correct >>>>>>>>>> >>>>>>>>>> I am wondering if there is any way to end up in the 2nd scenario, >>>>>>>>>> without having deleted all the documents first? >>>>>>>>>> > Consider the following sequence of events... >>>>>>>>>> an index with 2 segments (seg1 and seg2) originally created in >>>>>>>>>> Lucene 8.x. ==> Upgrade to 9.x ==> index few documents and commit >>>>>>>>>> ==> seg3 gets created with version 9.x, but merge doesn't kick in >>>>>>>>>> ==> documents in seg1 and seg2 get deleted followed by commit.==> >>>>>>>>>> You are left with seg3 in 9.x but indexCreatedVersionMajor as 8.x >>>>>>>>>> ==> Upgrade to Lucene 10.x fails. >>>>>>>>>> >>>>>>>>>> -Rahul >>>>>>>>>> >>>>>>>>>> On Sat, Apr 19, 2025 at 1:01 PM Ankit Jain <jain.ank...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Rahul, >>>>>>>>>>> >>>>>>>>>>> Thanks for starting this interesting discussion. I was initially >>>>>>>>>>> thinking that this API potentially allows upgrading >>>>>>>>>>> "indexCreatedVersionMajor" via the merge process after rewriting >>>>>>>>>>> all the segments, but I guess the SegmentInfo "minVersion" is the >>>>>>>>>>> min across all segments during the merge process? >>>>>>>>>>> >>>>>>>>>>> So, I am wondering if there is any way to end up in the 2nd >>>>>>>>>>> scenario, without having deleted all the documents first? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> Ankit >>>>>>>>>>> >>>>>>>>>>> On Sat, Apr 19, 2025 at 9:17 AM Rahul Goswami >>>>>>>>>>> <rahul196...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hello, >>>>>>>>>>>> Today even after all documents in an index are deleted via an API >>>>>>>>>>>> call, reindexing still doesn't change the >>>>>>>>>>>> "indexCreatedVersionMajor" property value in SegmentInfos. Hence >>>>>>>>>>>> even after complete reindexing, an upgrade path X--> X+1 --> X+2 >>>>>>>>>>>> is still not possible as we end up with an >>>>>>>>>>>> IndexFormatTooOldException. >>>>>>>>>>>> >>>>>>>>>>>> Requesting an API (on IndexWriter?) which can reset this property >>>>>>>>>>>> (upon a new commit) to the current Lucene version if: >>>>>>>>>>>> 1) No more live docs present >>>>>>>>>>>> OR >>>>>>>>>>>> 2) If all SegmentInfo in the index have a "minVersion" AND >>>>>>>>>>>> "version" stamp of the latest version , but SegmentInfos has an >>>>>>>>>>>> older "indexCreatedVersionMajor". >>>>>>>>>>>> >>>>>>>>>>>> This will help users a LOT since they can now interact with the >>>>>>>>>>>> index purely via API without needing manual deletion and also help >>>>>>>>>>>> open up a legitimate path to upgrade when an index doesn't HAVE to >>>>>>>>>>>> be repopulated from the source. >>>>>>>>>>>> >>>>>>>>>>>> If there is agreement, I am happy to pick this up and submit a PR. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Rahul Goswami >>>>>>>>>>>> >>>>>>>>>>>> >>>>> >>>>> >>>>> -- >>>>> Adrien >> >> >> >> -- >> Adrien --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org