Hi Rahul, I'm still concerned about going with in-place upgrading instead of reindexing in a separate directory. I don't mind opening an issue for discussion, but I'd like the option of reindexing into a separate directory to be considered. I think that it has lots of merits by avoiding multiple versions of the same analyzer to be used in the same index, allowing schemas to be upgraded (e.g. legacy GeoPoint -> LatLonPoint), etc. The downsides that come to mind are that it puts a bit more work on the application (rather than Lucene) and requires the upgrade to be done in a somewhat timely manner to be practical (or replaying the delta since the time when reindexing started may be heavy), but the trade-off still looks in favor of reindexing to a separate Directory to me.
On Thu, Apr 24, 2025 at 5:34 AM Rahul Goswami <rahul196...@gmail.com> wrote: > Hello, > I wanted to circle back on this discussion before I go ahead with the > creation of a Github issue. If there are any additional/unanswered concerns > about the request, I am happy to elaborate further. > Otherwise, I would like to go ahead and submit an issue and a PR for > further consideration. > > Thanks, > Rahul > > On Sun, Apr 20, 2025 at 8:14 PM Rahul Goswami <rahul196...@gmail.com> > wrote: > >> Adrien, >> Thanks for your thoughts on this. To clarify, the request is not for a >> Lucene API which will upgrade the index, as I understand it might not >> always be possible to do a lossless upgrade of a cold index. >> >> The request is for an API which when called, will check the version of >> each segment in the SegmentInfos, and if each of them *already* has the >> latest version, change the index created version in SegmentInfos to the >> latest. Some API on IndexWriter, say commitAndUpdateCreatedVersion(). >> >> None of the checks and balances or version compatibility restrictions >> from Lucene's side have to change. We (Solr) only aim to support this kind >> of reindexing between version X-1 and X. So as long as the schema adheres >> to certain conditions, you'd be able to go from X-1 to X to X+1 without >> needing to reindex from source, or provision infrastructure/effort for >> reindexing to a parallel copy, and still have a completely lossless index. >> >> If it helps for further consideration, I am also happy to demonstrate the >> implementation I have in mind for the API via a PR. >> >> - Rahul >> >> On Sun, Apr 20, 2025 at 4:12 AM Adrien Grand <jpou...@gmail.com> wrote: >> >>> I worry that allowing us to reset the index created version on an >>> existing index would only solve one part of the problem, while putting us >>> on a path that makes it harder or impossible to solve the other part of the >>> problem. >>> >>> If Lucene had had such an API from the beginning, we would still need to >>> deal with legacy stuff, such as the old way that norms were encoded in the >>> index ( >>> https://github.com/apache/lucene/blob/releases/lucene-solr/7.0.0/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java#L141-L145), >>> or trie fields ( >>> https://github.com/apache/lucene/blob/releases/lucene-solr/5.0.0/lucene/core/src/java/org/apache/lucene/document/IntField.java). >>> There currently is no way of upgrading norms or schemas in-place, and >>> supporting this sounds quite scary while I can't think of major downsides >>> of upgrading to a separate Directory and then atomically swapping pointers, >>> especially as Solr/Elasticsearch already have logic for replicating >>> operations to another copy of the data. >>> >>> This tells me that the upgraded index should not share state with the >>> previous index. And the upgrade process should take care of reindexing to a >>> new Directory while: >>> - bumping the index creation version, >>> - modernizing the schema if necessary (e.g. mapping "int" fields from >>> trie fields in the previous index to point fields in the new index, or even >>> sparse indexed fields in Lucene 10+), >>> - upgrading norms and analysis chains if necessary. >>> >>> >>> On Sun, Apr 20, 2025 at 8:47 AM Ankit Jain <jain.ank...@gmail.com> >>> wrote: >>> >>>> Got it, thanks for providing additional context on the use case! >>>> >>>> On Sat, Apr 19, 2025 at 9:40 PM Rahul Goswami <rahul196...@gmail.com> >>>> wrote: >>>> >>>>> There is an effort underway in Apache Solr where we want to provide a >>>>> path to a legitimate upgrade without needing to reindex from source: >>>>> https://issues.apache.org/jira/browse/SOLR-17725 >>>>> >>>>> Essentially the proposal is to read documents from segments where >>>>> minVersion < current version and reindex them. At the same time, while >>>>> the process is underway, have a custom merge policy which would exclude >>>>> such segments from merging with latest version segments to prevent >>>>> pollution. >>>>> >>>>> Result is an index which only contains segments with minVersion and >>>>> version stamps the same as the current Lucene version (essentially case #2 >>>>> that we discussed). This index would in all respects be an "upgraded" >>>>> index, but would need "indexCreatedVersionMajor" to be reset as well. This >>>>> is where the Lucene API (to reset "indexCreatedVersionMajor") becomes >>>>> essential. >>>>> >>>>> I believe this is a pattern which can also be adopted by other Lucene >>>>> based search engines like Opensearch and Elasticsearch, and hence having >>>>> this API could potentially benefit a large Lucene base. >>>>> >>>>> -Rahul >>>>> >>>>> On Sat, Apr 19, 2025 at 11:49 PM Ankit Jain <jain.ank...@gmail.com> >>>>> wrote: >>>>> >>>>>> > Consider the following sequence of events... >>>>>> an index with 2 segments (seg1 and seg2) originally created in Lucene >>>>>> 8.x. ==> Upgrade to 9.x ==> index few documents and commit ==> seg3 gets >>>>>> created with version 9.x, but merge doesn't kick in ==> documents in seg1 >>>>>> and seg2 get deleted followed by commit.==> You are left with seg3 in 9.x >>>>>> but indexCreatedVersionMajor as 8.x ==> Upgrade to Lucene 10.x fails. >>>>>> >>>>>> Thanks for the explanation. I am wondering if this is something that >>>>>> you commonly encounter, seems like a bit of an edge case? >>>>>> >>>>>> Regarding scenario 1, deleting the entire index and recreating it is >>>>>> generally faster and less resource intensive instead of deleting all the >>>>>> documents. Most systems built on top of Lucene like Solr, OpenSearch, >>>>>> Elasticsearch expose delete API for collection/index, and users just >>>>>> delete >>>>>> and recreate the index. Probably, one of the reasons it hasn't come up >>>>>> much >>>>>> before. Will let other community members chime in on this. >>>>>> >>>>>> On Sat, Apr 19, 2025 at 7:43 PM Rahul Goswami <rahul196...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> For complete clarity..."minVersion" for a SegmentInfo is the min of >>>>>>> the minVersions of all segments involved in the merge which resulted in >>>>>>> this segment. If it is a "pure" segment, then minVersion=version. >>>>>>> >>>>>>> On Sat, Apr 19, 2025 at 10:35 PM Rahul Goswami < >>>>>>> rahul196...@gmail.com> wrote: >>>>>>> >>>>>>>> Ankit, >>>>>>>> "I guess the SegmentInfo "minVersion" is the min across all >>>>>>>> segments during the merge process?" >>>>>>>> > That is correct >>>>>>>> >>>>>>>> I am wondering if there is any way to end up in the 2nd scenario, >>>>>>>> without having deleted all the documents first? >>>>>>>> > Consider the following sequence of events... >>>>>>>> an index with 2 segments (seg1 and seg2) originally created in >>>>>>>> Lucene 8.x. ==> Upgrade to 9.x ==> index few documents and commit ==> >>>>>>>> seg3 >>>>>>>> gets created with version 9.x, but merge doesn't kick in ==> documents >>>>>>>> in >>>>>>>> seg1 and seg2 get deleted followed by commit.==> You are left with >>>>>>>> seg3 in >>>>>>>> 9.x but indexCreatedVersionMajor as 8.x ==> Upgrade to Lucene 10.x >>>>>>>> fails. >>>>>>>> >>>>>>>> -Rahul >>>>>>>> >>>>>>>> On Sat, Apr 19, 2025 at 1:01 PM Ankit Jain <jain.ank...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Rahul, >>>>>>>>> >>>>>>>>> Thanks for starting this interesting discussion. I was initially >>>>>>>>> thinking that this API potentially allows upgrading >>>>>>>>> "indexCreatedVersionMajor" via the merge process after rewriting all >>>>>>>>> the >>>>>>>>> segments, but I guess the SegmentInfo "minVersion" is the min across >>>>>>>>> all >>>>>>>>> segments during the merge process? >>>>>>>>> >>>>>>>>> So, I am wondering if there is any way to end up in the 2nd >>>>>>>>> scenario, without having deleted all the documents first? >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Ankit >>>>>>>>> >>>>>>>>> On Sat, Apr 19, 2025 at 9:17 AM Rahul Goswami < >>>>>>>>> rahul196...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hello, >>>>>>>>>> Today even after all documents in an index are deleted via an API >>>>>>>>>> call, reindexing still doesn't change the "indexCreatedVersionMajor" >>>>>>>>>> property value in SegmentInfos. Hence even after complete reindexing, >>>>>>>>>> an upgrade path X--> X+1 --> X+2 is still not possible as we end up >>>>>>>>>> with an >>>>>>>>>> IndexFormatTooOldException. >>>>>>>>>> >>>>>>>>>> Requesting an API (on IndexWriter?) which can reset this property >>>>>>>>>> (upon a new commit) to the current Lucene version if: >>>>>>>>>> 1) No more live docs present >>>>>>>>>> OR >>>>>>>>>> 2) If all SegmentInfo in the index have a "minVersion" AND >>>>>>>>>> "version" stamp of the latest version , but SegmentInfos has an older >>>>>>>>>> "indexCreatedVersionMajor". >>>>>>>>>> >>>>>>>>>> This will help users a LOT since they can now interact with the >>>>>>>>>> index purely via API without needing manual deletion and also help >>>>>>>>>> open up >>>>>>>>>> a legitimate path to upgrade when an index doesn't HAVE to be >>>>>>>>>> repopulated >>>>>>>>>> from the source. >>>>>>>>>> >>>>>>>>>> If there is agreement, I am happy to pick this up and submit a PR. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Rahul Goswami >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>> >>> -- >>> Adrien >>> >> -- Adrien