Hello,
I wanted to circle back on this discussion before I go ahead with the
creation of a Github issue. If there are any additional/unanswered concerns
about the request, I am happy to elaborate further.
Otherwise, I would like to go ahead and submit an issue and a PR for
further consideration.

Thanks,
Rahul

On Sun, Apr 20, 2025 at 8:14 PM Rahul Goswami <rahul196...@gmail.com> wrote:

> Adrien,
> Thanks for your thoughts on this. To clarify, the request is not for a
> Lucene API which will upgrade the index, as I understand it might not
> always be possible to do a lossless upgrade of a cold index.
>
> The request is for an API which when called, will check the version of
> each segment in the SegmentInfos, and if each of them *already* has the
> latest version, change the index created version in SegmentInfos to the
> latest. Some API on IndexWriter, say commitAndUpdateCreatedVersion().
>
> None of the checks and balances or version compatibility restrictions from
> Lucene's side have to change. We (Solr) only aim to support this kind of
> reindexing between version X-1 and X. So as long as the schema adheres to
> certain conditions, you'd be able to go from X-1 to X to X+1 without
> needing to reindex from source, or provision infrastructure/effort for
> reindexing to a parallel copy, and still have a completely lossless index.
>
> If it helps for further consideration, I am also happy to demonstrate the
> implementation I have in mind for the API via a PR.
>
> - Rahul
>
> On Sun, Apr 20, 2025 at 4:12 AM Adrien Grand <jpou...@gmail.com> wrote:
>
>> I worry that allowing us to reset the index created version on an
>> existing index would only solve one part of the problem, while putting us
>> on a path that makes it harder or impossible to solve the other part of the
>> problem.
>>
>> If Lucene had had such an API from the beginning, we would still need to
>> deal with legacy stuff, such as the old way that norms were encoded in the
>> index (
>> https://github.com/apache/lucene/blob/releases/lucene-solr/7.0.0/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java#L141-L145),
>> or trie fields (
>> https://github.com/apache/lucene/blob/releases/lucene-solr/5.0.0/lucene/core/src/java/org/apache/lucene/document/IntField.java).
>> There currently is no way of upgrading norms or schemas in-place, and
>> supporting this sounds quite scary while I can't think of major downsides
>> of upgrading to a separate Directory and then atomically swapping pointers,
>> especially as Solr/Elasticsearch already have logic for replicating
>> operations to another copy of the data.
>>
>> This tells me that the upgraded index should not share state with the
>> previous index. And the upgrade process should take care of reindexing to a
>> new Directory while:
>>  - bumping the index creation version,
>>  - modernizing the schema if necessary (e.g. mapping "int" fields from
>> trie fields in the previous index to point fields in the new index, or even
>> sparse indexed fields in Lucene 10+),
>>  - upgrading norms and analysis chains if necessary.
>>
>>
>> On Sun, Apr 20, 2025 at 8:47 AM Ankit Jain <jain.ank...@gmail.com> wrote:
>>
>>> Got it, thanks for providing additional context on the use case!
>>>
>>> On Sat, Apr 19, 2025 at 9:40 PM Rahul Goswami <rahul196...@gmail.com>
>>> wrote:
>>>
>>>> There is an effort underway in Apache Solr where we want to provide a
>>>> path to a legitimate upgrade without needing to reindex from source:
>>>> https://issues.apache.org/jira/browse/SOLR-17725
>>>>
>>>> Essentially the proposal is to read documents from segments where
>>>> minVersion < current version and reindex them. At the same time, while
>>>> the process is underway,  have a custom merge policy which would exclude
>>>> such segments from merging with latest version segments to prevent
>>>> pollution.
>>>>
>>>> Result is an index which only contains segments with minVersion and
>>>> version stamps the same as the current Lucene version (essentially case #2
>>>> that we discussed). This index would in all respects be an "upgraded"
>>>> index, but would need "indexCreatedVersionMajor" to be reset as well. This
>>>> is where the Lucene API (to reset "indexCreatedVersionMajor") becomes
>>>> essential.
>>>>
>>>> I believe this is a pattern which can also be adopted by other Lucene
>>>> based search engines like Opensearch and Elasticsearch, and hence having
>>>> this API could potentially benefit a large Lucene base.
>>>>
>>>> -Rahul
>>>>
>>>> On Sat, Apr 19, 2025 at 11:49 PM Ankit Jain <jain.ank...@gmail.com>
>>>> wrote:
>>>>
>>>>> > Consider the following sequence of events...
>>>>> an index with 2 segments (seg1 and seg2) originally created in Lucene
>>>>> 8.x.  ==> Upgrade to 9.x ==> index few documents and commit ==> seg3 gets
>>>>> created with version 9.x, but merge doesn't kick in ==> documents in seg1
>>>>> and seg2 get deleted followed by commit.==> You are left with seg3 in 9.x
>>>>> but indexCreatedVersionMajor as 8.x ==> Upgrade to Lucene 10.x fails.
>>>>>
>>>>> Thanks for the explanation. I am wondering if this is something that
>>>>> you commonly encounter, seems like a bit of an edge case?
>>>>>
>>>>> Regarding scenario 1, deleting the entire index and recreating it is
>>>>> generally faster and less resource intensive instead of deleting all the
>>>>> documents. Most systems built on top of Lucene like Solr, OpenSearch,
>>>>> Elasticsearch expose delete API for collection/index, and users just 
>>>>> delete
>>>>> and recreate the index. Probably, one of the reasons it hasn't come up 
>>>>> much
>>>>> before. Will let other community members chime in on this.
>>>>>
>>>>> On Sat, Apr 19, 2025 at 7:43 PM Rahul Goswami <rahul196...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> For complete clarity..."minVersion" for a SegmentInfo is the min of
>>>>>> the minVersions of all segments involved in the merge which resulted in
>>>>>> this segment. If it is a "pure" segment, then minVersion=version.
>>>>>>
>>>>>> On Sat, Apr 19, 2025 at 10:35 PM Rahul Goswami <rahul196...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Ankit,
>>>>>>> "I guess the SegmentInfo "minVersion" is the min across all segments
>>>>>>> during the merge process?"
>>>>>>> > That is correct
>>>>>>>
>>>>>>> I am wondering if there is any way to end up in the 2nd scenario,
>>>>>>> without having deleted all the documents first?
>>>>>>> > Consider the following sequence of events...
>>>>>>> an index with 2 segments (seg1 and seg2) originally created in
>>>>>>> Lucene 8.x.  ==> Upgrade to 9.x ==> index few documents and commit ==> 
>>>>>>> seg3
>>>>>>> gets created with version 9.x, but merge doesn't kick in ==> documents 
>>>>>>> in
>>>>>>> seg1 and seg2 get deleted followed by commit.==> You are left with seg3 
>>>>>>> in
>>>>>>> 9.x but indexCreatedVersionMajor as 8.x ==> Upgrade to Lucene 10.x 
>>>>>>> fails.
>>>>>>>
>>>>>>> -Rahul
>>>>>>>
>>>>>>> On Sat, Apr 19, 2025 at 1:01 PM Ankit Jain <jain.ank...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Rahul,
>>>>>>>>
>>>>>>>> Thanks for starting this interesting discussion. I was initially
>>>>>>>> thinking that this API potentially allows upgrading
>>>>>>>> "indexCreatedVersionMajor" via the merge process after rewriting all 
>>>>>>>> the
>>>>>>>> segments, but I guess the SegmentInfo "minVersion" is the min across 
>>>>>>>> all
>>>>>>>> segments during the merge process?
>>>>>>>>
>>>>>>>> So, I am wondering if there is any way to end up in the 2nd
>>>>>>>> scenario, without having deleted all the documents first?
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Ankit
>>>>>>>>
>>>>>>>> On Sat, Apr 19, 2025 at 9:17 AM Rahul Goswami <
>>>>>>>> rahul196...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>> Today even after all documents in an index are deleted via an API
>>>>>>>>> call, reindexing still doesn't change the "indexCreatedVersionMajor"
>>>>>>>>> property value in SegmentInfos. Hence even after complete reindexing,
>>>>>>>>> an upgrade path X--> X+1 --> X+2 is still not possible as we end up 
>>>>>>>>> with an
>>>>>>>>> IndexFormatTooOldException.
>>>>>>>>>
>>>>>>>>> Requesting an API (on IndexWriter?) which can reset this property
>>>>>>>>> (upon a new commit) to the current Lucene version if:
>>>>>>>>> 1) No more live docs present
>>>>>>>>> OR
>>>>>>>>> 2) If all SegmentInfo in the index have a "minVersion" AND
>>>>>>>>> "version" stamp of the latest version , but SegmentInfos has an older
>>>>>>>>> "indexCreatedVersionMajor".
>>>>>>>>>
>>>>>>>>> This will help users a LOT since they can now interact with the
>>>>>>>>> index purely via API without needing manual deletion and also help 
>>>>>>>>> open up
>>>>>>>>> a legitimate path to upgrade when an index doesn't HAVE to be 
>>>>>>>>> repopulated
>>>>>>>>> from the source.
>>>>>>>>>
>>>>>>>>> If there is agreement, I am happy to pick this up and submit a PR.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Rahul Goswami
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>
>> --
>> Adrien
>>
>

Reply via email to