There is an effort underway in Apache Solr where we want to provide a path
to a legitimate upgrade without needing to reindex from source:
https://issues.apache.org/jira/browse/SOLR-17725

Essentially the proposal is to read documents from segments where
minVersion < current version and reindex them. At the same time, while
the process is underway,  have a custom merge policy which would exclude
such segments from merging with latest version segments to prevent
pollution.

Result is an index which only contains segments with minVersion and version
stamps the same as the current Lucene version (essentially case #2 that we
discussed). This index would in all respects be an "upgraded" index, but
would need "indexCreatedVersionMajor" to be reset as well. This is where
the Lucene API (to reset "indexCreatedVersionMajor") becomes essential.

I believe this is a pattern which can also be adopted by other Lucene based
search engines like Opensearch and Elasticsearch, and hence having this API
could potentially benefit a large Lucene base.

-Rahul

On Sat, Apr 19, 2025 at 11:49 PM Ankit Jain <jain.ank...@gmail.com> wrote:

> > Consider the following sequence of events...
> an index with 2 segments (seg1 and seg2) originally created in Lucene
> 8.x.  ==> Upgrade to 9.x ==> index few documents and commit ==> seg3 gets
> created with version 9.x, but merge doesn't kick in ==> documents in seg1
> and seg2 get deleted followed by commit.==> You are left with seg3 in 9.x
> but indexCreatedVersionMajor as 8.x ==> Upgrade to Lucene 10.x fails.
>
> Thanks for the explanation. I am wondering if this is something that you
> commonly encounter, seems like a bit of an edge case?
>
> Regarding scenario 1, deleting the entire index and recreating it is
> generally faster and less resource intensive instead of deleting all the
> documents. Most systems built on top of Lucene like Solr, OpenSearch,
> Elasticsearch expose delete API for collection/index, and users just delete
> and recreate the index. Probably, one of the reasons it hasn't come up much
> before. Will let other community members chime in on this.
>
> On Sat, Apr 19, 2025 at 7:43 PM Rahul Goswami <rahul196...@gmail.com>
> wrote:
>
>> For complete clarity..."minVersion" for a SegmentInfo is the min of the
>> minVersions of all segments involved in the merge which resulted in this
>> segment. If it is a "pure" segment, then minVersion=version.
>>
>> On Sat, Apr 19, 2025 at 10:35 PM Rahul Goswami <rahul196...@gmail.com>
>> wrote:
>>
>>> Ankit,
>>> "I guess the SegmentInfo "minVersion" is the min across all segments
>>> during the merge process?"
>>> > That is correct
>>>
>>> I am wondering if there is any way to end up in the 2nd scenario,
>>> without having deleted all the documents first?
>>> > Consider the following sequence of events...
>>> an index with 2 segments (seg1 and seg2) originally created in Lucene
>>> 8.x.  ==> Upgrade to 9.x ==> index few documents and commit ==> seg3 gets
>>> created with version 9.x, but merge doesn't kick in ==> documents in seg1
>>> and seg2 get deleted followed by commit.==> You are left with seg3 in 9.x
>>> but indexCreatedVersionMajor as 8.x ==> Upgrade to Lucene 10.x fails.
>>>
>>> -Rahul
>>>
>>> On Sat, Apr 19, 2025 at 1:01 PM Ankit Jain <jain.ank...@gmail.com>
>>> wrote:
>>>
>>>> Hi Rahul,
>>>>
>>>> Thanks for starting this interesting discussion. I was initially
>>>> thinking that this API potentially allows upgrading
>>>> "indexCreatedVersionMajor" via the merge process after rewriting all the
>>>> segments, but I guess the SegmentInfo "minVersion" is the min across all
>>>> segments during the merge process?
>>>>
>>>> So, I am wondering if there is any way to end up in the 2nd scenario,
>>>> without having deleted all the documents first?
>>>>
>>>>
>>>> Thanks
>>>> Ankit
>>>>
>>>> On Sat, Apr 19, 2025 at 9:17 AM Rahul Goswami <rahul196...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>> Today even after all documents in an index are deleted via an API
>>>>> call, reindexing still doesn't change the "indexCreatedVersionMajor"
>>>>> property value in SegmentInfos. Hence even after complete reindexing,
>>>>> an upgrade path X--> X+1 --> X+2 is still not possible as we end up with 
>>>>> an
>>>>> IndexFormatTooOldException.
>>>>>
>>>>> Requesting an API (on IndexWriter?) which can reset this property
>>>>> (upon a new commit) to the current Lucene version if:
>>>>> 1) No more live docs present
>>>>> OR
>>>>> 2) If all SegmentInfo in the index have a "minVersion" AND "version"
>>>>> stamp of the latest version , but SegmentInfos has an older
>>>>> "indexCreatedVersionMajor".
>>>>>
>>>>> This will help users a LOT since they can now interact with the index
>>>>> purely via API without needing manual deletion and also help open up a
>>>>> legitimate path to upgrade when an index doesn't HAVE to be repopulated
>>>>> from the source.
>>>>>
>>>>> If there is agreement, I am happy to pick this up and submit a PR.
>>>>>
>>>>> Thanks,
>>>>> Rahul Goswami
>>>>>
>>>>>
>>>>>

Reply via email to