[ 
https://issues.apache.org/jira/browse/LUCENE-8263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529960#comment-16529960
 ] 

Adrien Grand commented on LUCENE-8263:
--------------------------------------

I just had a quick look at this. Given that the score formula (the lower the 
better) for a merge is
{noformat}
({largest segment to merge} / {total size of the merge}) * {total size of the 
merge} ^ 0.05 * {live docs ratio} ^ reclaimDeletesWeight
{noformat}
A perfect merge with no deletes gets a score of
{noformat}
1/maxMergeAtOnce * maxSegmentSize^0.05 * 1{noformat}
and a merge to reclaim deletes on a segment of the maximum size would get a 
score of (assuming we find small segments to merge with the large segment that 
has many deletes so that the merge size is {{maxSegmentSize}})
{noformat}
{live docs ratio} * maxSegmentSize^0.05 * {live docs ratio} ^ 
reclaimDeletesWeight{noformat}
So if I'm not mistaken, we could just use {{reclaimDeletesWeight}} and add 
too-large segments to the list of candidates, and singleton merges would be 
possible when {{reclaimDeletesWeight}} is greater than
{noformat}
-log(maxMergeAtOnce) / log({live docs ratio}) - 1
{noformat}
Assuming the default max merge at once of 10, Lucene would allow for singleton 
merges with 20% deletes when {{reclaimDeletesWeight >= 9.3}} and singleton 
merges with 50% deletes when {{reclaimDeletesWeight >= 2.3}}.

I think the API would be nicer if we exposed this new target percentage of 
deleted docs that Erick is proposing, and then computed the weight for scoring 
deletes internally accordingly, so that we could remove this opaque 
{{reclaimDeletesWeight}} from the API?

bq. what should the floor be? I propose 10% with strong documentation warnings 
about not setting it below 20%.

My gut feeling is that our users focus on disk usage from deleted documents 
because this is something that we make very visible. I have seen some other 
systems reserve fixed amounts of space per value (thus waste) and users were 
just happy about it because they only looked at the average number of bytes per 
document which was fine and they didn't know about this waste. I'm tempted to 
be more conservative than that and rather use 20% as a floor.


> Add indexPctDeletedTarget as a parameter to TieredMergePolicy to control more 
> aggressive merging
> ------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8263
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8263
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Major
>
> Spinoff of LUCENE-7976 to keep the two issues separate.
> The current TMP allows up to 50% deleted docs, which can be wasteful on large 
> indexes. This parameter will do more aggressive merging of segments with 
> deleted documents when the _total_ percentage of deleted docs in the entire 
> index exceeds it.
> Setting this to 50% should approximate current behavior. Setting it to 20% 
> caused the first cut at this to increase I/O roughly 10%. Setting it to 10% 
> caused about a 50% increase in I/O.
> I was conflating the two issues, so I'll change 7976 and comment out the bits 
> that reference this new parameter. After it's checked in we can bring this 
> back. That should be less work than reconstructing this later.
> Among the questions to be answered:
> 1> what should the default be? I propose 20% as it results in significantly 
> less space wasted and helps control heap usage for a modest increase in I/O.
> 2> what should the floor be? I propose 10% with _strong_ documentation 
> warnings about not setting it below 20%.
> 3> should there be two parameters? I think this was discussed somewhat in 
> 7976. The first cut at  this used this number for two purposes:
> 3a> the total percentage of deleted docs index-wide to trip this trigger
> 3b> the percentage of an _individual_ segment that had to be deleted if the 
> segment was over maxSegmentSize/2 bytes in order to be eligible for merging. 
> Empirically, using the same percentage for both caused the merging to hover 
> around the value specified for this parameter.
> My proposal for <3> would be to have the parameter do double-duty. Assuming 
> my preliminary results hold, you specify this parameter at, say, 20% and once 
> the index hits that % deleted docs it hovers right around there, even if 
> you've forceMerged earlier down to 1 segment. This seems in line with what 
> I'd expect and adding another parameter seems excessively complicated to no 
> good purpose. We could always add something like that later if we wanted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to