Hi Rahul,

I'm still concerned about going with in-place upgrading instead of
reindexing in a separate directory. I don't mind opening an issue for
discussion, but I'd like the option of reindexing into a separate directory
to be considered. I think that it has lots of merits by avoiding multiple
versions of the same analyzer to be used in the same index, allowing
schemas to be upgraded (e.g. legacy GeoPoint -> LatLonPoint), etc. The
downsides that come to mind are that it puts a bit more work on the
application (rather than Lucene) and requires the upgrade to be done in a
somewhat timely manner to be practical (or replaying the delta since the
time when reindexing started may be heavy), but the trade-off still looks
in favor of reindexing to a separate Directory to me.

On Thu, Apr 24, 2025 at 5:34 AM Rahul Goswami <rahul196...@gmail.com> wrote:

> Hello,
> I wanted to circle back on this discussion before I go ahead with the
> creation of a Github issue. If there are any additional/unanswered concerns
> about the request, I am happy to elaborate further.
> Otherwise, I would like to go ahead and submit an issue and a PR for
> further consideration.
>
> Thanks,
> Rahul
>
> On Sun, Apr 20, 2025 at 8:14 PM Rahul Goswami <rahul196...@gmail.com>
> wrote:
>
>> Adrien,
>> Thanks for your thoughts on this. To clarify, the request is not for a
>> Lucene API which will upgrade the index, as I understand it might not
>> always be possible to do a lossless upgrade of a cold index.
>>
>> The request is for an API which when called, will check the version of
>> each segment in the SegmentInfos, and if each of them *already* has the
>> latest version, change the index created version in SegmentInfos to the
>> latest. Some API on IndexWriter, say commitAndUpdateCreatedVersion().
>>
>> None of the checks and balances or version compatibility restrictions
>> from Lucene's side have to change. We (Solr) only aim to support this kind
>> of reindexing between version X-1 and X. So as long as the schema adheres
>> to certain conditions, you'd be able to go from X-1 to X to X+1 without
>> needing to reindex from source, or provision infrastructure/effort for
>> reindexing to a parallel copy, and still have a completely lossless index.
>>
>> If it helps for further consideration, I am also happy to demonstrate the
>> implementation I have in mind for the API via a PR.
>>
>> - Rahul
>>
>> On Sun, Apr 20, 2025 at 4:12 AM Adrien Grand <jpou...@gmail.com> wrote:
>>
>>> I worry that allowing us to reset the index created version on an
>>> existing index would only solve one part of the problem, while putting us
>>> on a path that makes it harder or impossible to solve the other part of the
>>> problem.
>>>
>>> If Lucene had had such an API from the beginning, we would still need to
>>> deal with legacy stuff, such as the old way that norms were encoded in the
>>> index (
>>> https://github.com/apache/lucene/blob/releases/lucene-solr/7.0.0/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java#L141-L145),
>>> or trie fields (
>>> https://github.com/apache/lucene/blob/releases/lucene-solr/5.0.0/lucene/core/src/java/org/apache/lucene/document/IntField.java).
>>> There currently is no way of upgrading norms or schemas in-place, and
>>> supporting this sounds quite scary while I can't think of major downsides
>>> of upgrading to a separate Directory and then atomically swapping pointers,
>>> especially as Solr/Elasticsearch already have logic for replicating
>>> operations to another copy of the data.
>>>
>>> This tells me that the upgraded index should not share state with the
>>> previous index. And the upgrade process should take care of reindexing to a
>>> new Directory while:
>>>  - bumping the index creation version,
>>>  - modernizing the schema if necessary (e.g. mapping "int" fields from
>>> trie fields in the previous index to point fields in the new index, or even
>>> sparse indexed fields in Lucene 10+),
>>>  - upgrading norms and analysis chains if necessary.
>>>
>>>
>>> On Sun, Apr 20, 2025 at 8:47 AM Ankit Jain <jain.ank...@gmail.com>
>>> wrote:
>>>
>>>> Got it, thanks for providing additional context on the use case!
>>>>
>>>> On Sat, Apr 19, 2025 at 9:40 PM Rahul Goswami <rahul196...@gmail.com>
>>>> wrote:
>>>>
>>>>> There is an effort underway in Apache Solr where we want to provide a
>>>>> path to a legitimate upgrade without needing to reindex from source:
>>>>> https://issues.apache.org/jira/browse/SOLR-17725
>>>>>
>>>>> Essentially the proposal is to read documents from segments where
>>>>> minVersion < current version and reindex them. At the same time, while
>>>>> the process is underway,  have a custom merge policy which would exclude
>>>>> such segments from merging with latest version segments to prevent
>>>>> pollution.
>>>>>
>>>>> Result is an index which only contains segments with minVersion and
>>>>> version stamps the same as the current Lucene version (essentially case #2
>>>>> that we discussed). This index would in all respects be an "upgraded"
>>>>> index, but would need "indexCreatedVersionMajor" to be reset as well. This
>>>>> is where the Lucene API (to reset "indexCreatedVersionMajor") becomes
>>>>> essential.
>>>>>
>>>>> I believe this is a pattern which can also be adopted by other Lucene
>>>>> based search engines like Opensearch and Elasticsearch, and hence having
>>>>> this API could potentially benefit a large Lucene base.
>>>>>
>>>>> -Rahul
>>>>>
>>>>> On Sat, Apr 19, 2025 at 11:49 PM Ankit Jain <jain.ank...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> > Consider the following sequence of events...
>>>>>> an index with 2 segments (seg1 and seg2) originally created in Lucene
>>>>>> 8.x.  ==> Upgrade to 9.x ==> index few documents and commit ==> seg3 gets
>>>>>> created with version 9.x, but merge doesn't kick in ==> documents in seg1
>>>>>> and seg2 get deleted followed by commit.==> You are left with seg3 in 9.x
>>>>>> but indexCreatedVersionMajor as 8.x ==> Upgrade to Lucene 10.x fails.
>>>>>>
>>>>>> Thanks for the explanation. I am wondering if this is something that
>>>>>> you commonly encounter, seems like a bit of an edge case?
>>>>>>
>>>>>> Regarding scenario 1, deleting the entire index and recreating it is
>>>>>> generally faster and less resource intensive instead of deleting all the
>>>>>> documents. Most systems built on top of Lucene like Solr, OpenSearch,
>>>>>> Elasticsearch expose delete API for collection/index, and users just 
>>>>>> delete
>>>>>> and recreate the index. Probably, one of the reasons it hasn't come up 
>>>>>> much
>>>>>> before. Will let other community members chime in on this.
>>>>>>
>>>>>> On Sat, Apr 19, 2025 at 7:43 PM Rahul Goswami <rahul196...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> For complete clarity..."minVersion" for a SegmentInfo is the min of
>>>>>>> the minVersions of all segments involved in the merge which resulted in
>>>>>>> this segment. If it is a "pure" segment, then minVersion=version.
>>>>>>>
>>>>>>> On Sat, Apr 19, 2025 at 10:35 PM Rahul Goswami <
>>>>>>> rahul196...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Ankit,
>>>>>>>> "I guess the SegmentInfo "minVersion" is the min across all
>>>>>>>> segments during the merge process?"
>>>>>>>> > That is correct
>>>>>>>>
>>>>>>>> I am wondering if there is any way to end up in the 2nd scenario,
>>>>>>>> without having deleted all the documents first?
>>>>>>>> > Consider the following sequence of events...
>>>>>>>> an index with 2 segments (seg1 and seg2) originally created in
>>>>>>>> Lucene 8.x.  ==> Upgrade to 9.x ==> index few documents and commit ==> 
>>>>>>>> seg3
>>>>>>>> gets created with version 9.x, but merge doesn't kick in ==> documents 
>>>>>>>> in
>>>>>>>> seg1 and seg2 get deleted followed by commit.==> You are left with 
>>>>>>>> seg3 in
>>>>>>>> 9.x but indexCreatedVersionMajor as 8.x ==> Upgrade to Lucene 10.x 
>>>>>>>> fails.
>>>>>>>>
>>>>>>>> -Rahul
>>>>>>>>
>>>>>>>> On Sat, Apr 19, 2025 at 1:01 PM Ankit Jain <jain.ank...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Rahul,
>>>>>>>>>
>>>>>>>>> Thanks for starting this interesting discussion. I was initially
>>>>>>>>> thinking that this API potentially allows upgrading
>>>>>>>>> "indexCreatedVersionMajor" via the merge process after rewriting all 
>>>>>>>>> the
>>>>>>>>> segments, but I guess the SegmentInfo "minVersion" is the min across 
>>>>>>>>> all
>>>>>>>>> segments during the merge process?
>>>>>>>>>
>>>>>>>>> So, I am wondering if there is any way to end up in the 2nd
>>>>>>>>> scenario, without having deleted all the documents first?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Ankit
>>>>>>>>>
>>>>>>>>> On Sat, Apr 19, 2025 at 9:17 AM Rahul Goswami <
>>>>>>>>> rahul196...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>> Today even after all documents in an index are deleted via an API
>>>>>>>>>> call, reindexing still doesn't change the "indexCreatedVersionMajor"
>>>>>>>>>> property value in SegmentInfos. Hence even after complete reindexing,
>>>>>>>>>> an upgrade path X--> X+1 --> X+2 is still not possible as we end up 
>>>>>>>>>> with an
>>>>>>>>>> IndexFormatTooOldException.
>>>>>>>>>>
>>>>>>>>>> Requesting an API (on IndexWriter?) which can reset this property
>>>>>>>>>> (upon a new commit) to the current Lucene version if:
>>>>>>>>>> 1) No more live docs present
>>>>>>>>>> OR
>>>>>>>>>> 2) If all SegmentInfo in the index have a "minVersion" AND
>>>>>>>>>> "version" stamp of the latest version , but SegmentInfos has an older
>>>>>>>>>> "indexCreatedVersionMajor".
>>>>>>>>>>
>>>>>>>>>> This will help users a LOT since they can now interact with the
>>>>>>>>>> index purely via API without needing manual deletion and also help 
>>>>>>>>>> open up
>>>>>>>>>> a legitimate path to upgrade when an index doesn't HAVE to be 
>>>>>>>>>> repopulated
>>>>>>>>>> from the source.
>>>>>>>>>>
>>>>>>>>>> If there is agreement, I am happy to pick this up and submit a PR.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Rahul Goswami
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>
>>> --
>>> Adrien
>>>
>>

-- 
Adrien

Reply via email to