Re: Requesting Lucene API to allow resetting index created version property

Rahul Goswami Thu, 24 Apr 2025 22:24:31 -0700

Adrien,
Appreciate your thoughts on this. I agree that the option of reindexing
into a separate Directory is less "surgical" and definitely something to
consider when the scale is manageable. Solr already has an API
<https://solr.apache.org/guide/8_1/collections-api.html#reindexcollection>
which can do this for SolrCloud. However, in my view, for cases where all
the source fields are either stored or docValues true, the idea to support
in-place upgrade merits a consideration due to the following reasons:


1) A lot of users may not have the budget or bandwidth to allocate
additional resources to reindex into a parallel Directory. Especially in
cases where there are thousands of indexes across hundreds of nodes, this
process can quickly become cumbersome. This comes from a personal
experience which was a major driver for me building this solution for
my employer.

2) Search engines like Solr and Elasticsearch are often a piece in a bigger
commercial software offering. In case of deployments which are in customer
environments and not completely in control of the vendor, this proposition
of having to completely reindex the data on to a parallel hardware can
become a hard sell.

3) For install bases using Solr, Elasticsearch or OpenSearch, removing the
overhead of index upgrade really helps them to look at the search engine as
one homogenous piece of software that needs upgrading(when the time comes)
rather than having to account for the index separately.

I understand that in cases where the data type for an existing field
changes (eg: Trie vs Point field as you mentioned), reindexing into a fresh
Directory is the only option. However in my experience, since 8.x I have
rarely seen this happen, and for such users I do see a way they can achieve
significant savings in terms of cost and effort by means of in-place
upgrading.

Reindexing in-place can provide a solid alternative to the now retired
Lucene IndexUpgrader Tool as a means to effectively achieve a lossless
upgrade. It is the search engine's responsibility to build the guardrails
around the upgrade process(eg: ensuring that all source fields are either
stored or docValues true, or (maybe) disable updates via external API calls
while such a reindexing is in progress etc). But I do believe the effort is
worthy of consideration.

Thanks,
Rahul



On Thu, Apr 24, 2025 at 3:06 AM Adrien Grand <jpou...@gmail.com> wrote:

> Hi Rahul,
>
> I'm still concerned about going with in-place upgrading instead of
> reindexing in a separate directory. I don't mind opening an issue for
> discussion, but I'd like the option of reindexing into a separate directory
> to be considered. I think that it has lots of merits by avoiding multiple
> versions of the same analyzer to be used in the same index, allowing
> schemas to be upgraded (e.g. legacy GeoPoint -> LatLonPoint), etc. The
> downsides that come to mind are that it puts a bit more work on the
> application (rather than Lucene) and requires the upgrade to be done in a
> somewhat timely manner to be practical (or replaying the delta since the
> time when reindexing started may be heavy), but the trade-off still looks
> in favor of reindexing to a separate Directory to me.
>
> On Thu, Apr 24, 2025 at 5:34 AM Rahul Goswami <rahul196...@gmail.com>
> wrote:
>
>> Hello,
>> I wanted to circle back on this discussion before I go ahead with the
>> creation of a Github issue. If there are any additional/unanswered concerns
>> about the request, I am happy to elaborate further.
>> Otherwise, I would like to go ahead and submit an issue and a PR for
>> further consideration.
>>
>> Thanks,
>> Rahul
>>
>> On Sun, Apr 20, 2025 at 8:14 PM Rahul Goswami <rahul196...@gmail.com>
>> wrote:
>>
>>> Adrien,
>>> Thanks for your thoughts on this. To clarify, the request is not for a
>>> Lucene API which will upgrade the index, as I understand it might not
>>> always be possible to do a lossless upgrade of a cold index.
>>>
>>> The request is for an API which when called, will check the version of
>>> each segment in the SegmentInfos, and if each of them *already* has the
>>> latest version, change the index created version in SegmentInfos to the
>>> latest. Some API on IndexWriter, say commitAndUpdateCreatedVersion().
>>>
>>> None of the checks and balances or version compatibility restrictions
>>> from Lucene's side have to change. We (Solr) only aim to support this kind
>>> of reindexing between version X-1 and X. So as long as the schema adheres
>>> to certain conditions, you'd be able to go from X-1 to X to X+1 without
>>> needing to reindex from source, or provision infrastructure/effort for
>>> reindexing to a parallel copy, and still have a completely lossless index.
>>>
>>> If it helps for further consideration, I am also happy to
>>> demonstrate the implementation I have in mind for the API via a PR.
>>>
>>> - Rahul
>>>
>>> On Sun, Apr 20, 2025 at 4:12 AM Adrien Grand <jpou...@gmail.com> wrote:
>>>
>>>> I worry that allowing us to reset the index created version on an
>>>> existing index would only solve one part of the problem, while putting us
>>>> on a path that makes it harder or impossible to solve the other part of the
>>>> problem.
>>>>
>>>> If Lucene had had such an API from the beginning, we would still need
>>>> to deal with legacy stuff, such as the old way that norms were encoded in
>>>> the index (
>>>> https://github.com/apache/lucene/blob/releases/lucene-solr/7.0.0/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java#L141-L145),
>>>> or trie fields (
>>>> https://github.com/apache/lucene/blob/releases/lucene-solr/5.0.0/lucene/core/src/java/org/apache/lucene/document/IntField.java).
>>>> There currently is no way of upgrading norms or schemas in-place, and
>>>> supporting this sounds quite scary while I can't think of major downsides
>>>> of upgrading to a separate Directory and then atomically swapping pointers,
>>>> especially as Solr/Elasticsearch already have logic for replicating
>>>> operations to another copy of the data.
>>>>
>>>> This tells me that the upgraded index should not share state with the
>>>> previous index. And the upgrade process should take care of reindexing to a
>>>> new Directory while:
>>>>  - bumping the index creation version,
>>>>  - modernizing the schema if necessary (e.g. mapping "int" fields from
>>>> trie fields in the previous index to point fields in the new index, or even
>>>> sparse indexed fields in Lucene 10+),
>>>>  - upgrading norms and analysis chains if necessary.
>>>>
>>>>
>>>> On Sun, Apr 20, 2025 at 8:47 AM Ankit Jain <jain.ank...@gmail.com>
>>>> wrote:
>>>>
>>>>> Got it, thanks for providing additional context on the use case!
>>>>>
>>>>> On Sat, Apr 19, 2025 at 9:40 PM Rahul Goswami <rahul196...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> There is an effort underway in Apache Solr where we want to provide a
>>>>>> path to a legitimate upgrade without needing to reindex from source:
>>>>>> https://issues.apache.org/jira/browse/SOLR-17725
>>>>>>
>>>>>> Essentially the proposal is to read documents from segments where
>>>>>> minVersion < current version and reindex them. At the same time, while
>>>>>> the process is underway,  have a custom merge policy which would exclude
>>>>>> such segments from merging with latest version segments to prevent
>>>>>> pollution.
>>>>>>
>>>>>> Result is an index which only contains segments with minVersion and
>>>>>> version stamps the same as the current Lucene version (essentially case 
>>>>>> #2
>>>>>> that we discussed). This index would in all respects be an "upgraded"
>>>>>> index, but would need "indexCreatedVersionMajor" to be reset as well. 
>>>>>> This
>>>>>> is where the Lucene API (to reset "indexCreatedVersionMajor") becomes
>>>>>> essential.
>>>>>>
>>>>>> I believe this is a pattern which can also be adopted by other Lucene
>>>>>> based search engines like Opensearch and Elasticsearch, and hence having
>>>>>> this API could potentially benefit a large Lucene base.
>>>>>>
>>>>>> -Rahul
>>>>>>
>>>>>> On Sat, Apr 19, 2025 at 11:49 PM Ankit Jain <jain.ank...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> > Consider the following sequence of events...
>>>>>>> an index with 2 segments (seg1 and seg2) originally created in
>>>>>>> Lucene 8.x.  ==> Upgrade to 9.x ==> index few documents and commit ==> 
>>>>>>> seg3
>>>>>>> gets created with version 9.x, but merge doesn't kick in ==> documents 
>>>>>>> in
>>>>>>> seg1 and seg2 get deleted followed by commit.==> You are left with seg3 
>>>>>>> in
>>>>>>> 9.x but indexCreatedVersionMajor as 8.x ==> Upgrade to Lucene 10.x 
>>>>>>> fails.
>>>>>>>
>>>>>>> Thanks for the explanation. I am wondering if this is something that
>>>>>>> you commonly encounter, seems like a bit of an edge case?
>>>>>>>
>>>>>>> Regarding scenario 1, deleting the entire index and recreating it is
>>>>>>> generally faster and less resource intensive instead of deleting all the
>>>>>>> documents. Most systems built on top of Lucene like Solr, OpenSearch,
>>>>>>> Elasticsearch expose delete API for collection/index, and users just 
>>>>>>> delete
>>>>>>> and recreate the index. Probably, one of the reasons it hasn't come up 
>>>>>>> much
>>>>>>> before. Will let other community members chime in on this.
>>>>>>>
>>>>>>> On Sat, Apr 19, 2025 at 7:43 PM Rahul Goswami <rahul196...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> For complete clarity..."minVersion" for a SegmentInfo is the min of
>>>>>>>> the minVersions of all segments involved in the merge which resulted in
>>>>>>>> this segment. If it is a "pure" segment, then minVersion=version.
>>>>>>>>
>>>>>>>> On Sat, Apr 19, 2025 at 10:35 PM Rahul Goswami <
>>>>>>>> rahul196...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Ankit,
>>>>>>>>> "I guess the SegmentInfo "minVersion" is the min across all
>>>>>>>>> segments during the merge process?"
>>>>>>>>> > That is correct
>>>>>>>>>
>>>>>>>>> I am wondering if there is any way to end up in the 2nd scenario,
>>>>>>>>> without having deleted all the documents first?
>>>>>>>>> > Consider the following sequence of events...
>>>>>>>>> an index with 2 segments (seg1 and seg2) originally created in
>>>>>>>>> Lucene 8.x.  ==> Upgrade to 9.x ==> index few documents and commit 
>>>>>>>>> ==> seg3
>>>>>>>>> gets created with version 9.x, but merge doesn't kick in ==> 
>>>>>>>>> documents in
>>>>>>>>> seg1 and seg2 get deleted followed by commit.==> You are left with 
>>>>>>>>> seg3 in
>>>>>>>>> 9.x but indexCreatedVersionMajor as 8.x ==> Upgrade to Lucene 10.x 
>>>>>>>>> fails.
>>>>>>>>>
>>>>>>>>> -Rahul
>>>>>>>>>
>>>>>>>>> On Sat, Apr 19, 2025 at 1:01 PM Ankit Jain <jain.ank...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Rahul,
>>>>>>>>>>
>>>>>>>>>> Thanks for starting this interesting discussion. I was initially
>>>>>>>>>> thinking that this API potentially allows upgrading
>>>>>>>>>> "indexCreatedVersionMajor" via the merge process after rewriting all 
>>>>>>>>>> the
>>>>>>>>>> segments, but I guess the SegmentInfo "minVersion" is the min across 
>>>>>>>>>> all
>>>>>>>>>> segments during the merge process?
>>>>>>>>>>
>>>>>>>>>> So, I am wondering if there is any way to end up in the 2nd
>>>>>>>>>> scenario, without having deleted all the documents first?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Ankit
>>>>>>>>>>
>>>>>>>>>> On Sat, Apr 19, 2025 at 9:17 AM Rahul Goswami <
>>>>>>>>>> rahul196...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello,
>>>>>>>>>>> Today even after all documents in an index are deleted via an
>>>>>>>>>>> API call, reindexing still doesn't change the 
>>>>>>>>>>> "indexCreatedVersionMajor"
>>>>>>>>>>> property value in SegmentInfos. Hence even after complete 
>>>>>>>>>>> reindexing,
>>>>>>>>>>> an upgrade path X--> X+1 --> X+2 is still not possible as we end up 
>>>>>>>>>>> with an
>>>>>>>>>>> IndexFormatTooOldException.
>>>>>>>>>>>
>>>>>>>>>>> Requesting an API (on IndexWriter?) which can reset this
>>>>>>>>>>> property (upon a new commit) to the current Lucene version if:
>>>>>>>>>>> 1) No more live docs present
>>>>>>>>>>> OR
>>>>>>>>>>> 2) If all SegmentInfo in the index have a "minVersion" AND
>>>>>>>>>>> "version" stamp of the latest version , but SegmentInfos has an 
>>>>>>>>>>> older
>>>>>>>>>>> "indexCreatedVersionMajor".
>>>>>>>>>>>
>>>>>>>>>>> This will help users a LOT since they can now interact with the
>>>>>>>>>>> index purely via API without needing manual deletion and also help 
>>>>>>>>>>> open up
>>>>>>>>>>> a legitimate path to upgrade when an index doesn't HAVE to be 
>>>>>>>>>>> repopulated
>>>>>>>>>>> from the source.
>>>>>>>>>>>
>>>>>>>>>>> If there is agreement, I am happy to pick this up and submit a
>>>>>>>>>>> PR.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Rahul Goswami
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>
>>>> --
>>>> Adrien
>>>>
>>>
>
> --
> Adrien
>

Re: Requesting Lucene API to allow resetting index created version property

Reply via email to