[ 
https://issues.apache.org/jira/browse/LUCENE-9324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17084682#comment-17084682
 ] 

Simon Willnauer commented on LUCENE-9324:
-----------------------------------------

I am trying to give a bit more context to this issue. Today we have 
_SegmentInfo_ which represents a segment once it's written to disk for instance 
at flush or merge time. We have a randomly generated ID in _SegmentInfo_ that 
can be used to verify if two segments are the same. Since we use incremental 
numbers for segment naming it's likely that two IndexWriters produce a segment 
with very similar contents and the same name. Yet, the _SegmentInfo_ id would 
be different. In addition to this ID we also have checksums on files which can 
be used to verify identity in addition to the ID but should not be treated 
identity by itself since they are very weak checksums. 
Now segments also get _updated_ for instance when a documents is marked as 
deleted or the segment receives a doc values update. The only thing that 
changes is the delete or update generation which also allow two IndexWriters 
that opened two copies of a segment (with the same segment ID) to produce a new 
delGen or dvGen that looks identical from the outside but are actually 
different. This is a problem that we see quite frequently in Elasticsearch and 
we'd like to prevent or have a better tool in our hands to distinguish 
_SegmentCommitInfo_ instances from another. If we'd have an ID on 
SegmentCommitInfo that changes each time one of these generations changes we 
could much easier tell if only the updated files (which are often very small) 
need to be replaced in order to recover an index. 

The plan is to implement this in a very similar fashion as we did on the 
_SegmentInfo_ but also invalidate the once any of the generations change in 
order to force a new _SegmentCommitInfo_ ID for the new generation. Yet, the 
IDs would not be the same if two IndexWriters start from the same segment 
making an identical change to the segment ie. it's not a replacement for a 
strong hash function.

> Give IDs to SegmentCommitInfo
> -----------------------------
>
>                 Key: LUCENE-9324
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9324
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Adrien Grand
>            Priority: Minor
>
> We already have IDs in SegmentInfo, which are useful to uniquely identify 
> segments. Having IDs on SegmentCommitInfo would be useful too in order to 
> compare commits for equality and make snapshots incremental on generational 
> files too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to