RS146BIJAY opened a new issue, #15352:
URL: https://github.com/apache/lucene/issues/15352

   ### Description
   
   ## Background
   
   We are currently working on a feature in OpenSearch to support context aware 
segment within OpenSearch which involves maintaining multiple IndexWriter 
instances, one for each group, within a shard to collocate related data into 
same segment or group of segments. The design is detailed in the following RFCs 
and LLD:
   
   * [OpenSearch 
RFC](https://github.com/opensearch-project/OpenSearch/issues/18576)
   * [Lucene RFC](https://github.com/apache/lucene/issues/13387)
   * [OpenSearch 
LLD](https://github.com/opensearch-project/OpenSearch/issues/19530)
   
   ## Current Use Case
   
   With Context Aware Segment, within a shard, writes are routed to respective 
group-specific `IndexWriter` instances. To maintain consistent versioning 
across writers during update operation, we perform a **hard delete** of the 
previous document version in the parent (accumulating) `IndexWriter` whenever a 
new version is added to a group-specific writer.
   
   ## Problem Description
   
   Currently with just soft deletes enabled, during OpenSearch's DocRep 
recovery, OpenSearch [uses `SegmentReader.hardLiveDocs` to query live 
docs](https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/common/lucene/Lucene.java#L942)
 from segments with hard deletes (which may have gotten introduced due to 
IndexWriter hitting non-aborted exceptions). The number of liveDocs is 
efficiently derived as:
   
   `segmentReader.maxDoc() - segmentReader.getSegmentInfo().getDelCount()`
   
   However, by performing both soft and hard delete on a context aware enabled 
Lucene Index, the above calculation breaks down as 
`segmentReader.getSegmentInfo().getDelCount()` no longer provide the accurate 
live delete count on a segment. Based on [Lucene's unit tests for mixed 
deletes](https://github.com/apache/lucene/blob/f2da05b25396a72adb07895c8858a15841c3c6a9/lucene/core/src/test/org/apache/lucene/index/TestSoftDeletesRetentionMergePolicy.java#L696),
 the only reliable method to get the live doc count is to iterate through the 
hardLiveDocs and count the set bits. 
   
   ## Performance Impact
   
   This iterative counting operation is computationally expensive for large 
segments and can potentially cause significant performance regressions during 
shard recovery.
   
   ## Ask from this issue
   
   Is there a more optimized, direct way to retrieve the count of live 
documents from a SegmentReader's hardLiveDocs when a segment has undergone both 
hard and soft deletes?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to