I worry that allowing us to reset the index created version on an existing
index would only solve one part of the problem, while putting us on a path
that makes it harder or impossible to solve the other part of the problem.

If Lucene had had such an API from the beginning, we would still need to
deal with legacy stuff, such as the old way that norms were encoded in the
index (
https://github.com/apache/lucene/blob/releases/lucene-solr/7.0.0/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java#L141-L145),
or trie fields (
https://github.com/apache/lucene/blob/releases/lucene-solr/5.0.0/lucene/core/src/java/org/apache/lucene/document/IntField.java).
There currently is no way of upgrading norms or schemas in-place, and
supporting this sounds quite scary while I can't think of major downsides
of upgrading to a separate Directory and then atomically swapping pointers,
especially as Solr/Elasticsearch already have logic for replicating
operations to another copy of the data.

This tells me that the upgraded index should not share state with the
previous index. And the upgrade process should take care of reindexing to a
new Directory while:
 - bumping the index creation version,
 - modernizing the schema if necessary (e.g. mapping "int" fields from trie
fields in the previous index to point fields in the new index, or even
sparse indexed fields in Lucene 10+),
 - upgrading norms and analysis chains if necessary.


On Sun, Apr 20, 2025 at 8:47 AM Ankit Jain <jain.ank...@gmail.com> wrote:

> Got it, thanks for providing additional context on the use case!
>
> On Sat, Apr 19, 2025 at 9:40 PM Rahul Goswami <rahul196...@gmail.com>
> wrote:
>
>> There is an effort underway in Apache Solr where we want to provide a
>> path to a legitimate upgrade without needing to reindex from source:
>> https://issues.apache.org/jira/browse/SOLR-17725
>>
>> Essentially the proposal is to read documents from segments where
>> minVersion < current version and reindex them. At the same time, while
>> the process is underway,  have a custom merge policy which would exclude
>> such segments from merging with latest version segments to prevent
>> pollution.
>>
>> Result is an index which only contains segments with minVersion and
>> version stamps the same as the current Lucene version (essentially case #2
>> that we discussed). This index would in all respects be an "upgraded"
>> index, but would need "indexCreatedVersionMajor" to be reset as well. This
>> is where the Lucene API (to reset "indexCreatedVersionMajor") becomes
>> essential.
>>
>> I believe this is a pattern which can also be adopted by other Lucene
>> based search engines like Opensearch and Elasticsearch, and hence having
>> this API could potentially benefit a large Lucene base.
>>
>> -Rahul
>>
>> On Sat, Apr 19, 2025 at 11:49 PM Ankit Jain <jain.ank...@gmail.com>
>> wrote:
>>
>>> > Consider the following sequence of events...
>>> an index with 2 segments (seg1 and seg2) originally created in Lucene
>>> 8.x.  ==> Upgrade to 9.x ==> index few documents and commit ==> seg3 gets
>>> created with version 9.x, but merge doesn't kick in ==> documents in seg1
>>> and seg2 get deleted followed by commit.==> You are left with seg3 in 9.x
>>> but indexCreatedVersionMajor as 8.x ==> Upgrade to Lucene 10.x fails.
>>>
>>> Thanks for the explanation. I am wondering if this is something that you
>>> commonly encounter, seems like a bit of an edge case?
>>>
>>> Regarding scenario 1, deleting the entire index and recreating it is
>>> generally faster and less resource intensive instead of deleting all the
>>> documents. Most systems built on top of Lucene like Solr, OpenSearch,
>>> Elasticsearch expose delete API for collection/index, and users just delete
>>> and recreate the index. Probably, one of the reasons it hasn't come up much
>>> before. Will let other community members chime in on this.
>>>
>>> On Sat, Apr 19, 2025 at 7:43 PM Rahul Goswami <rahul196...@gmail.com>
>>> wrote:
>>>
>>>> For complete clarity..."minVersion" for a SegmentInfo is the min of the
>>>> minVersions of all segments involved in the merge which resulted in this
>>>> segment. If it is a "pure" segment, then minVersion=version.
>>>>
>>>> On Sat, Apr 19, 2025 at 10:35 PM Rahul Goswami <rahul196...@gmail.com>
>>>> wrote:
>>>>
>>>>> Ankit,
>>>>> "I guess the SegmentInfo "minVersion" is the min across all segments
>>>>> during the merge process?"
>>>>> > That is correct
>>>>>
>>>>> I am wondering if there is any way to end up in the 2nd scenario,
>>>>> without having deleted all the documents first?
>>>>> > Consider the following sequence of events...
>>>>> an index with 2 segments (seg1 and seg2) originally created in Lucene
>>>>> 8.x.  ==> Upgrade to 9.x ==> index few documents and commit ==> seg3 gets
>>>>> created with version 9.x, but merge doesn't kick in ==> documents in seg1
>>>>> and seg2 get deleted followed by commit.==> You are left with seg3 in 9.x
>>>>> but indexCreatedVersionMajor as 8.x ==> Upgrade to Lucene 10.x fails.
>>>>>
>>>>> -Rahul
>>>>>
>>>>> On Sat, Apr 19, 2025 at 1:01 PM Ankit Jain <jain.ank...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Rahul,
>>>>>>
>>>>>> Thanks for starting this interesting discussion. I was initially
>>>>>> thinking that this API potentially allows upgrading
>>>>>> "indexCreatedVersionMajor" via the merge process after rewriting all the
>>>>>> segments, but I guess the SegmentInfo "minVersion" is the min across all
>>>>>> segments during the merge process?
>>>>>>
>>>>>> So, I am wondering if there is any way to end up in the 2nd scenario,
>>>>>> without having deleted all the documents first?
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>> Ankit
>>>>>>
>>>>>> On Sat, Apr 19, 2025 at 9:17 AM Rahul Goswami <rahul196...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>> Today even after all documents in an index are deleted via an API
>>>>>>> call, reindexing still doesn't change the "indexCreatedVersionMajor"
>>>>>>> property value in SegmentInfos. Hence even after complete reindexing,
>>>>>>> an upgrade path X--> X+1 --> X+2 is still not possible as we end up 
>>>>>>> with an
>>>>>>> IndexFormatTooOldException.
>>>>>>>
>>>>>>> Requesting an API (on IndexWriter?) which can reset this property
>>>>>>> (upon a new commit) to the current Lucene version if:
>>>>>>> 1) No more live docs present
>>>>>>> OR
>>>>>>> 2) If all SegmentInfo in the index have a "minVersion" AND "version"
>>>>>>> stamp of the latest version , but SegmentInfos has an older
>>>>>>> "indexCreatedVersionMajor".
>>>>>>>
>>>>>>> This will help users a LOT since they can now interact with the
>>>>>>> index purely via API without needing manual deletion and also help open 
>>>>>>> up
>>>>>>> a legitimate path to upgrade when an index doesn't HAVE to be 
>>>>>>> repopulated
>>>>>>> from the source.
>>>>>>>
>>>>>>> If there is agreement, I am happy to pick this up and submit a PR.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Rahul Goswami
>>>>>>>
>>>>>>>
>>>>>>>

-- 
Adrien

Reply via email to