jainankitk commented on issue #13084:
URL: https://github.com/apache/lucene/issues/13084#issuecomment-3515182515

   > However, in practice, Lucene90LiveDocsFormat only has access to maxDoc and 
delCount (via SegmentCommitInfo) when deciding which implementation to use. We 
don't know the deletion distribution pattern until after loading.
   
   Oh yeah, ofcourse. We will know the memory footprint only after building it, 
kind of overlooked that while thinking about it.
   
   > I've completed benchmarking across various deletion patterns (RANDOM, 
CLUSTERED, UNIFORM) and found that a 1% threshold (deletedDocs/maxDoc) works 
well for all cases with consistent iteration improvements (4x worst case) when 
using the sparse implementation and minimal memory overhead (+5% worst case) 
even for unlikely pathological inputs. 
   
   Thank you for comprehensive benchmarking using different deletion patterns. 
I am not surprised that lower threshold like 1% works well. We can always start 
conservative with a low value and increase the threshold as and when we see 
value for different use cases.
   
   > I have a draft https://github.com/apache/lucene/pull/15413...I might need 
to do more changes...especially if we need to make this easier to use for 
consumers like PointTreeBucketCollector.
   
   Woah, that was quick! will try to review the changes soon. Will be great to 
consume this as part of `PointTreeBucketCollector` and handle leaves with low # 
deleted documents
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to