[ https://issues.apache.org/jira/browse/LUCENE-9324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17084682#comment-17084682 ]
Simon Willnauer commented on LUCENE-9324: ----------------------------------------- I am trying to give a bit more context to this issue. Today we have _SegmentInfo_ which represents a segment once it's written to disk for instance at flush or merge time. We have a randomly generated ID in _SegmentInfo_ that can be used to verify if two segments are the same. Since we use incremental numbers for segment naming it's likely that two IndexWriters produce a segment with very similar contents and the same name. Yet, the _SegmentInfo_ id would be different. In addition to this ID we also have checksums on files which can be used to verify identity in addition to the ID but should not be treated identity by itself since they are very weak checksums. Now segments also get _updated_ for instance when a documents is marked as deleted or the segment receives a doc values update. The only thing that changes is the delete or update generation which also allow two IndexWriters that opened two copies of a segment (with the same segment ID) to produce a new delGen or dvGen that looks identical from the outside but are actually different. This is a problem that we see quite frequently in Elasticsearch and we'd like to prevent or have a better tool in our hands to distinguish _SegmentCommitInfo_ instances from another. If we'd have an ID on SegmentCommitInfo that changes each time one of these generations changes we could much easier tell if only the updated files (which are often very small) need to be replaced in order to recover an index. The plan is to implement this in a very similar fashion as we did on the _SegmentInfo_ but also invalidate the once any of the generations change in order to force a new _SegmentCommitInfo_ ID for the new generation. Yet, the IDs would not be the same if two IndexWriters start from the same segment making an identical change to the segment ie. it's not a replacement for a strong hash function. > Give IDs to SegmentCommitInfo > ----------------------------- > > Key: LUCENE-9324 > URL: https://issues.apache.org/jira/browse/LUCENE-9324 > Project: Lucene - Core > Issue Type: New Feature > Reporter: Adrien Grand > Priority: Minor > > We already have IDs in SegmentInfo, which are useful to uniquely identify > segments. Having IDs on SegmentCommitInfo would be useful too in order to > compare commits for equality and make snapshots incremental on generational > files too. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org