I'm strongly opposed to allowing code to change/reset this value.
Lucene needs to be able to defend itself from bogus bug reports for
this and ensure back compat really works.

I instead support Mark Miller's proposal to only increase the
minimum-version when necessary on the lucene side:
https://github.com/apache/lucene/issues/13797

On Fri, Apr 25, 2025 at 1:24 AM Rahul Goswami <rahul196...@gmail.com> wrote:
>
> Adrien,
> Appreciate your thoughts on this. I agree that the option of reindexing into 
> a separate Directory is less "surgical" and definitely something to consider 
> when the scale is manageable. Solr already has an API which can do this for 
> SolrCloud. However, in my view, for cases where all the source fields are 
> either stored or docValues true, the idea to support in-place upgrade merits 
> a consideration due to the following reasons:
>
> 1) A lot of users may not have the budget or bandwidth to allocate additional 
> resources to reindex into a parallel Directory. Especially in cases where 
> there are thousands of indexes across hundreds of nodes, this process can 
> quickly become cumbersome. This comes from a personal experience which was a 
> major driver for me building this solution for my employer.
>
> 2) Search engines like Solr and Elasticsearch are often a piece in a bigger 
> commercial software offering. In case of deployments which are in customer 
> environments and not completely in control of the vendor, this proposition of 
> having to completely reindex the data on to a parallel hardware can become a 
> hard sell.
>
> 3) For install bases using Solr, Elasticsearch or OpenSearch, removing the 
> overhead of index upgrade really helps them to look at the search engine as 
> one homogenous piece of software that needs upgrading(when the time comes) 
> rather than having to account for the index separately.
>
> I understand that in cases where the data type for an existing field changes 
> (eg: Trie vs Point field as you mentioned), reindexing into a fresh Directory 
> is the only option. However in my experience, since 8.x I have rarely seen 
> this happen, and for such users I do see a way they can achieve significant 
> savings in terms of cost and effort by means of in-place upgrading.
>
> Reindexing in-place can provide a solid alternative to the now retired Lucene 
> IndexUpgrader Tool as a means to effectively achieve a lossless upgrade. It 
> is the search engine's responsibility to build the guardrails around the 
> upgrade process(eg: ensuring that all source fields are either stored or 
> docValues true, or (maybe) disable updates via external API calls while such 
> a reindexing is in progress etc). But I do believe the effort is worthy of 
> consideration.
>
> Thanks,
> Rahul
>
>
>
> On Thu, Apr 24, 2025 at 3:06 AM Adrien Grand <jpou...@gmail.com> wrote:
>>
>> Hi Rahul,
>>
>> I'm still concerned about going with in-place upgrading instead of 
>> reindexing in a separate directory. I don't mind opening an issue for 
>> discussion, but I'd like the option of reindexing into a separate directory 
>> to be considered. I think that it has lots of merits by avoiding multiple 
>> versions of the same analyzer to be used in the same index, allowing schemas 
>> to be upgraded (e.g. legacy GeoPoint -> LatLonPoint), etc. The downsides 
>> that come to mind are that it puts a bit more work on the application 
>> (rather than Lucene) and requires the upgrade to be done in a somewhat 
>> timely manner to be practical (or replaying the delta since the time when 
>> reindexing started may be heavy), but the trade-off still looks in favor of 
>> reindexing to a separate Directory to me.
>>
>> On Thu, Apr 24, 2025 at 5:34 AM Rahul Goswami <rahul196...@gmail.com> wrote:
>>>
>>> Hello,
>>> I wanted to circle back on this discussion before I go ahead with the 
>>> creation of a Github issue. If there are any additional/unanswered concerns 
>>> about the request, I am happy to elaborate further.
>>> Otherwise, I would like to go ahead and submit an issue and a PR for 
>>> further consideration.
>>>
>>> Thanks,
>>> Rahul
>>>
>>> On Sun, Apr 20, 2025 at 8:14 PM Rahul Goswami <rahul196...@gmail.com> wrote:
>>>>
>>>> Adrien,
>>>> Thanks for your thoughts on this. To clarify, the request is not for a 
>>>> Lucene API which will upgrade the index, as I understand it might not 
>>>> always be possible to do a lossless upgrade of a cold index.
>>>>
>>>> The request is for an API which when called, will check the version of 
>>>> each segment in the SegmentInfos, and if each of them *already* has the 
>>>> latest version, change the index created version in SegmentInfos to the 
>>>> latest. Some API on IndexWriter, say commitAndUpdateCreatedVersion().
>>>>
>>>> None of the checks and balances or version compatibility restrictions from 
>>>> Lucene's side have to change. We (Solr) only aim to support this kind of 
>>>> reindexing between version X-1 and X. So as long as the schema adheres to 
>>>> certain conditions, you'd be able to go from X-1 to X to X+1 without 
>>>> needing to reindex from source, or provision infrastructure/effort for 
>>>> reindexing to a parallel copy, and still have a completely lossless index.
>>>>
>>>> If it helps for further consideration, I am also happy to demonstrate the 
>>>> implementation I have in mind for the API via a PR.
>>>>
>>>> - Rahul
>>>>
>>>> On Sun, Apr 20, 2025 at 4:12 AM Adrien Grand <jpou...@gmail.com> wrote:
>>>>>
>>>>> I worry that allowing us to reset the index created version on an 
>>>>> existing index would only solve one part of the problem, while putting us 
>>>>> on a path that makes it harder or impossible to solve the other part of 
>>>>> the problem.
>>>>>
>>>>> If Lucene had had such an API from the beginning, we would still need to 
>>>>> deal with legacy stuff, such as the old way that norms were encoded in 
>>>>> the index 
>>>>> (https://github.com/apache/lucene/blob/releases/lucene-solr/7.0.0/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java#L141-L145),
>>>>>  or trie fields 
>>>>> (https://github.com/apache/lucene/blob/releases/lucene-solr/5.0.0/lucene/core/src/java/org/apache/lucene/document/IntField.java).
>>>>>  There currently is no way of upgrading norms or schemas in-place, and 
>>>>> supporting this sounds quite scary while I can't think of major downsides 
>>>>> of upgrading to a separate Directory and then atomically swapping 
>>>>> pointers, especially as Solr/Elasticsearch already have logic for 
>>>>> replicating operations to another copy of the data.
>>>>>
>>>>> This tells me that the upgraded index should not share state with the 
>>>>> previous index. And the upgrade process should take care of reindexing to 
>>>>> a new Directory while:
>>>>>  - bumping the index creation version,
>>>>>  - modernizing the schema if necessary (e.g. mapping "int" fields from 
>>>>> trie fields in the previous index to point fields in the new index, or 
>>>>> even sparse indexed fields in Lucene 10+),
>>>>>  - upgrading norms and analysis chains if necessary.
>>>>>
>>>>>
>>>>> On Sun, Apr 20, 2025 at 8:47 AM Ankit Jain <jain.ank...@gmail.com> wrote:
>>>>>>
>>>>>> Got it, thanks for providing additional context on the use case!
>>>>>>
>>>>>> On Sat, Apr 19, 2025 at 9:40 PM Rahul Goswami <rahul196...@gmail.com> 
>>>>>> wrote:
>>>>>>>
>>>>>>> There is an effort underway in Apache Solr where we want to provide a 
>>>>>>> path to a legitimate upgrade without needing to reindex from source:
>>>>>>> https://issues.apache.org/jira/browse/SOLR-17725
>>>>>>>
>>>>>>> Essentially the proposal is to read documents from segments where 
>>>>>>> minVersion < current version and reindex them. At the same time, while 
>>>>>>> the process is underway,  have a custom merge policy which would 
>>>>>>> exclude such segments from merging with latest version segments to 
>>>>>>> prevent pollution.
>>>>>>>
>>>>>>> Result is an index which only contains segments with minVersion and 
>>>>>>> version stamps the same as the current Lucene version (essentially case 
>>>>>>> #2 that we discussed). This index would in all respects be an 
>>>>>>> "upgraded" index, but would need "indexCreatedVersionMajor" to be reset 
>>>>>>> as well. This is where the Lucene API (to reset 
>>>>>>> "indexCreatedVersionMajor") becomes essential.
>>>>>>>
>>>>>>> I believe this is a pattern which can also be adopted by other Lucene 
>>>>>>> based search engines like Opensearch and Elasticsearch, and hence 
>>>>>>> having this API could potentially benefit a large Lucene base.
>>>>>>>
>>>>>>> -Rahul
>>>>>>>
>>>>>>> On Sat, Apr 19, 2025 at 11:49 PM Ankit Jain <jain.ank...@gmail.com> 
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> > Consider the following sequence of events...
>>>>>>>> an index with 2 segments (seg1 and seg2) originally created in Lucene 
>>>>>>>> 8.x.  ==> Upgrade to 9.x ==> index few documents and commit ==> seg3 
>>>>>>>> gets created with version 9.x, but merge doesn't kick in ==> documents 
>>>>>>>> in seg1 and seg2 get deleted followed by commit.==> You are left with 
>>>>>>>> seg3 in 9.x but indexCreatedVersionMajor as 8.x ==> Upgrade to Lucene 
>>>>>>>> 10.x fails.
>>>>>>>>
>>>>>>>> Thanks for the explanation. I am wondering if this is something that 
>>>>>>>> you commonly encounter, seems like a bit of an edge case?
>>>>>>>>
>>>>>>>> Regarding scenario 1, deleting the entire index and recreating it is 
>>>>>>>> generally faster and less resource intensive instead of deleting all 
>>>>>>>> the documents. Most systems built on top of Lucene like Solr, 
>>>>>>>> OpenSearch, Elasticsearch expose delete API for collection/index, and 
>>>>>>>> users just delete and recreate the index. Probably, one of the reasons 
>>>>>>>> it hasn't come up much before. Will let other community members chime 
>>>>>>>> in on this.
>>>>>>>>
>>>>>>>> On Sat, Apr 19, 2025 at 7:43 PM Rahul Goswami <rahul196...@gmail.com> 
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> For complete clarity..."minVersion" for a SegmentInfo is the min of 
>>>>>>>>> the minVersions of all segments involved in the merge which resulted 
>>>>>>>>> in this segment. If it is a "pure" segment, then minVersion=version.
>>>>>>>>>
>>>>>>>>> On Sat, Apr 19, 2025 at 10:35 PM Rahul Goswami 
>>>>>>>>> <rahul196...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Ankit,
>>>>>>>>>> "I guess the SegmentInfo "minVersion" is the min across all segments 
>>>>>>>>>> during the merge process?"
>>>>>>>>>> > That is correct
>>>>>>>>>>
>>>>>>>>>> I am wondering if there is any way to end up in the 2nd scenario, 
>>>>>>>>>> without having deleted all the documents first?
>>>>>>>>>> > Consider the following sequence of events...
>>>>>>>>>> an index with 2 segments (seg1 and seg2) originally created in 
>>>>>>>>>> Lucene 8.x.  ==> Upgrade to 9.x ==> index few documents and commit 
>>>>>>>>>> ==> seg3 gets created with version 9.x, but merge doesn't kick in 
>>>>>>>>>> ==> documents in seg1 and seg2 get deleted followed by commit.==> 
>>>>>>>>>> You are left with seg3 in 9.x but indexCreatedVersionMajor as 8.x 
>>>>>>>>>> ==> Upgrade to Lucene 10.x fails.
>>>>>>>>>>
>>>>>>>>>> -Rahul
>>>>>>>>>>
>>>>>>>>>> On Sat, Apr 19, 2025 at 1:01 PM Ankit Jain <jain.ank...@gmail.com> 
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Rahul,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for starting this interesting discussion. I was initially 
>>>>>>>>>>> thinking that this API potentially allows upgrading 
>>>>>>>>>>> "indexCreatedVersionMajor" via the merge process after rewriting 
>>>>>>>>>>> all the segments, but I guess the SegmentInfo "minVersion" is the 
>>>>>>>>>>> min across all segments during the merge process?
>>>>>>>>>>>
>>>>>>>>>>> So, I am wondering if there is any way to end up in the 2nd 
>>>>>>>>>>> scenario, without having deleted all the documents first?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> Ankit
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Apr 19, 2025 at 9:17 AM Rahul Goswami 
>>>>>>>>>>> <rahul196...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hello,
>>>>>>>>>>>> Today even after all documents in an index are deleted via an API 
>>>>>>>>>>>> call, reindexing still doesn't change the 
>>>>>>>>>>>> "indexCreatedVersionMajor" property value in SegmentInfos. Hence 
>>>>>>>>>>>> even after complete reindexing, an upgrade path X--> X+1 --> X+2 
>>>>>>>>>>>> is still not possible as we end up with an 
>>>>>>>>>>>> IndexFormatTooOldException.
>>>>>>>>>>>>
>>>>>>>>>>>> Requesting an API (on IndexWriter?) which can reset this property 
>>>>>>>>>>>> (upon a new commit) to the current Lucene version if:
>>>>>>>>>>>> 1) No more live docs present
>>>>>>>>>>>> OR
>>>>>>>>>>>> 2) If all SegmentInfo in the index have a "minVersion" AND 
>>>>>>>>>>>> "version" stamp of the latest version , but SegmentInfos has an 
>>>>>>>>>>>> older "indexCreatedVersionMajor".
>>>>>>>>>>>>
>>>>>>>>>>>> This will help users a LOT since they can now interact with the 
>>>>>>>>>>>> index purely via API without needing manual deletion and also help 
>>>>>>>>>>>> open up a legitimate path to upgrade when an index doesn't HAVE to 
>>>>>>>>>>>> be repopulated from the source.
>>>>>>>>>>>>
>>>>>>>>>>>> If there is agreement, I am happy to pick this up and submit a PR.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Rahul Goswami
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Adrien
>>
>>
>>
>> --
>> Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to