Hello, I wanted to circle back on this discussion before I go ahead with the creation of a Github issue. If there are any additional/unanswered concerns about the request, I am happy to elaborate further. Otherwise, I would like to go ahead and submit an issue and a PR for further consideration.
Thanks, Rahul On Sun, Apr 20, 2025 at 8:14 PM Rahul Goswami <rahul196...@gmail.com> wrote: > Adrien, > Thanks for your thoughts on this. To clarify, the request is not for a > Lucene API which will upgrade the index, as I understand it might not > always be possible to do a lossless upgrade of a cold index. > > The request is for an API which when called, will check the version of > each segment in the SegmentInfos, and if each of them *already* has the > latest version, change the index created version in SegmentInfos to the > latest. Some API on IndexWriter, say commitAndUpdateCreatedVersion(). > > None of the checks and balances or version compatibility restrictions from > Lucene's side have to change. We (Solr) only aim to support this kind of > reindexing between version X-1 and X. So as long as the schema adheres to > certain conditions, you'd be able to go from X-1 to X to X+1 without > needing to reindex from source, or provision infrastructure/effort for > reindexing to a parallel copy, and still have a completely lossless index. > > If it helps for further consideration, I am also happy to demonstrate the > implementation I have in mind for the API via a PR. > > - Rahul > > On Sun, Apr 20, 2025 at 4:12 AM Adrien Grand <jpou...@gmail.com> wrote: > >> I worry that allowing us to reset the index created version on an >> existing index would only solve one part of the problem, while putting us >> on a path that makes it harder or impossible to solve the other part of the >> problem. >> >> If Lucene had had such an API from the beginning, we would still need to >> deal with legacy stuff, such as the old way that norms were encoded in the >> index ( >> https://github.com/apache/lucene/blob/releases/lucene-solr/7.0.0/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java#L141-L145), >> or trie fields ( >> https://github.com/apache/lucene/blob/releases/lucene-solr/5.0.0/lucene/core/src/java/org/apache/lucene/document/IntField.java). >> There currently is no way of upgrading norms or schemas in-place, and >> supporting this sounds quite scary while I can't think of major downsides >> of upgrading to a separate Directory and then atomically swapping pointers, >> especially as Solr/Elasticsearch already have logic for replicating >> operations to another copy of the data. >> >> This tells me that the upgraded index should not share state with the >> previous index. And the upgrade process should take care of reindexing to a >> new Directory while: >> - bumping the index creation version, >> - modernizing the schema if necessary (e.g. mapping "int" fields from >> trie fields in the previous index to point fields in the new index, or even >> sparse indexed fields in Lucene 10+), >> - upgrading norms and analysis chains if necessary. >> >> >> On Sun, Apr 20, 2025 at 8:47 AM Ankit Jain <jain.ank...@gmail.com> wrote: >> >>> Got it, thanks for providing additional context on the use case! >>> >>> On Sat, Apr 19, 2025 at 9:40 PM Rahul Goswami <rahul196...@gmail.com> >>> wrote: >>> >>>> There is an effort underway in Apache Solr where we want to provide a >>>> path to a legitimate upgrade without needing to reindex from source: >>>> https://issues.apache.org/jira/browse/SOLR-17725 >>>> >>>> Essentially the proposal is to read documents from segments where >>>> minVersion < current version and reindex them. At the same time, while >>>> the process is underway, have a custom merge policy which would exclude >>>> such segments from merging with latest version segments to prevent >>>> pollution. >>>> >>>> Result is an index which only contains segments with minVersion and >>>> version stamps the same as the current Lucene version (essentially case #2 >>>> that we discussed). This index would in all respects be an "upgraded" >>>> index, but would need "indexCreatedVersionMajor" to be reset as well. This >>>> is where the Lucene API (to reset "indexCreatedVersionMajor") becomes >>>> essential. >>>> >>>> I believe this is a pattern which can also be adopted by other Lucene >>>> based search engines like Opensearch and Elasticsearch, and hence having >>>> this API could potentially benefit a large Lucene base. >>>> >>>> -Rahul >>>> >>>> On Sat, Apr 19, 2025 at 11:49 PM Ankit Jain <jain.ank...@gmail.com> >>>> wrote: >>>> >>>>> > Consider the following sequence of events... >>>>> an index with 2 segments (seg1 and seg2) originally created in Lucene >>>>> 8.x. ==> Upgrade to 9.x ==> index few documents and commit ==> seg3 gets >>>>> created with version 9.x, but merge doesn't kick in ==> documents in seg1 >>>>> and seg2 get deleted followed by commit.==> You are left with seg3 in 9.x >>>>> but indexCreatedVersionMajor as 8.x ==> Upgrade to Lucene 10.x fails. >>>>> >>>>> Thanks for the explanation. I am wondering if this is something that >>>>> you commonly encounter, seems like a bit of an edge case? >>>>> >>>>> Regarding scenario 1, deleting the entire index and recreating it is >>>>> generally faster and less resource intensive instead of deleting all the >>>>> documents. Most systems built on top of Lucene like Solr, OpenSearch, >>>>> Elasticsearch expose delete API for collection/index, and users just >>>>> delete >>>>> and recreate the index. Probably, one of the reasons it hasn't come up >>>>> much >>>>> before. Will let other community members chime in on this. >>>>> >>>>> On Sat, Apr 19, 2025 at 7:43 PM Rahul Goswami <rahul196...@gmail.com> >>>>> wrote: >>>>> >>>>>> For complete clarity..."minVersion" for a SegmentInfo is the min of >>>>>> the minVersions of all segments involved in the merge which resulted in >>>>>> this segment. If it is a "pure" segment, then minVersion=version. >>>>>> >>>>>> On Sat, Apr 19, 2025 at 10:35 PM Rahul Goswami <rahul196...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Ankit, >>>>>>> "I guess the SegmentInfo "minVersion" is the min across all segments >>>>>>> during the merge process?" >>>>>>> > That is correct >>>>>>> >>>>>>> I am wondering if there is any way to end up in the 2nd scenario, >>>>>>> without having deleted all the documents first? >>>>>>> > Consider the following sequence of events... >>>>>>> an index with 2 segments (seg1 and seg2) originally created in >>>>>>> Lucene 8.x. ==> Upgrade to 9.x ==> index few documents and commit ==> >>>>>>> seg3 >>>>>>> gets created with version 9.x, but merge doesn't kick in ==> documents >>>>>>> in >>>>>>> seg1 and seg2 get deleted followed by commit.==> You are left with seg3 >>>>>>> in >>>>>>> 9.x but indexCreatedVersionMajor as 8.x ==> Upgrade to Lucene 10.x >>>>>>> fails. >>>>>>> >>>>>>> -Rahul >>>>>>> >>>>>>> On Sat, Apr 19, 2025 at 1:01 PM Ankit Jain <jain.ank...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Rahul, >>>>>>>> >>>>>>>> Thanks for starting this interesting discussion. I was initially >>>>>>>> thinking that this API potentially allows upgrading >>>>>>>> "indexCreatedVersionMajor" via the merge process after rewriting all >>>>>>>> the >>>>>>>> segments, but I guess the SegmentInfo "minVersion" is the min across >>>>>>>> all >>>>>>>> segments during the merge process? >>>>>>>> >>>>>>>> So, I am wondering if there is any way to end up in the 2nd >>>>>>>> scenario, without having deleted all the documents first? >>>>>>>> >>>>>>>> >>>>>>>> Thanks >>>>>>>> Ankit >>>>>>>> >>>>>>>> On Sat, Apr 19, 2025 at 9:17 AM Rahul Goswami < >>>>>>>> rahul196...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hello, >>>>>>>>> Today even after all documents in an index are deleted via an API >>>>>>>>> call, reindexing still doesn't change the "indexCreatedVersionMajor" >>>>>>>>> property value in SegmentInfos. Hence even after complete reindexing, >>>>>>>>> an upgrade path X--> X+1 --> X+2 is still not possible as we end up >>>>>>>>> with an >>>>>>>>> IndexFormatTooOldException. >>>>>>>>> >>>>>>>>> Requesting an API (on IndexWriter?) which can reset this property >>>>>>>>> (upon a new commit) to the current Lucene version if: >>>>>>>>> 1) No more live docs present >>>>>>>>> OR >>>>>>>>> 2) If all SegmentInfo in the index have a "minVersion" AND >>>>>>>>> "version" stamp of the latest version , but SegmentInfos has an older >>>>>>>>> "indexCreatedVersionMajor". >>>>>>>>> >>>>>>>>> This will help users a LOT since they can now interact with the >>>>>>>>> index purely via API without needing manual deletion and also help >>>>>>>>> open up >>>>>>>>> a legitimate path to upgrade when an index doesn't HAVE to be >>>>>>>>> repopulated >>>>>>>>> from the source. >>>>>>>>> >>>>>>>>> If there is agreement, I am happy to pick this up and submit a PR. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Rahul Goswami >>>>>>>>> >>>>>>>>> >>>>>>>>> >> >> -- >> Adrien >> >