[ https://issues.apache.org/jira/browse/LUCENE-8263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16543616#comment-16543616 ]
Marc Morissette commented on LUCENE-8263: ----------------------------------------- I would like to argue against a 20% floor. Some indexes contain documents of wildly different sizes with the larger documents experiencing much higher turnover. I have seen indexes with around 20% deletions that were more than 2x their optimized size because of this phenomenon. I such situations, deletesPctAllowed around 10-15% would make a lot of sense. I say keep the floor at 10%. > Add indexPctDeletedTarget as a parameter to TieredMergePolicy to control more > aggressive merging > ------------------------------------------------------------------------------------------------ > > Key: LUCENE-8263 > URL: https://issues.apache.org/jira/browse/LUCENE-8263 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Erick Erickson > Assignee: Erick Erickson > Priority: Major > Attachments: LUCENE-8263.patch > > > Spinoff of LUCENE-7976 to keep the two issues separate. > The current TMP allows up to 50% deleted docs, which can be wasteful on large > indexes. This parameter will do more aggressive merging of segments with > deleted documents when the _total_ percentage of deleted docs in the entire > index exceeds it. > Setting this to 50% should approximate current behavior. Setting it to 20% > caused the first cut at this to increase I/O roughly 10%. Setting it to 10% > caused about a 50% increase in I/O. > I was conflating the two issues, so I'll change 7976 and comment out the bits > that reference this new parameter. After it's checked in we can bring this > back. That should be less work than reconstructing this later. > Among the questions to be answered: > 1> what should the default be? I propose 20% as it results in significantly > less space wasted and helps control heap usage for a modest increase in I/O. > 2> what should the floor be? I propose 10% with _strong_ documentation > warnings about not setting it below 20%. > 3> should there be two parameters? I think this was discussed somewhat in > 7976. The first cut at this used this number for two purposes: > 3a> the total percentage of deleted docs index-wide to trip this trigger > 3b> the percentage of an _individual_ segment that had to be deleted if the > segment was over maxSegmentSize/2 bytes in order to be eligible for merging. > Empirically, using the same percentage for both caused the merging to hover > around the value specified for this parameter. > My proposal for <3> would be to have the parameter do double-duty. Assuming > my preliminary results hold, you specify this parameter at, say, 20% and once > the index hits that % deleted docs it hovers right around there, even if > you've forceMerged earlier down to 1 segment. This seems in line with what > I'd expect and adding another parameter seems excessively complicated to no > good purpose. We could always add something like that later if we wanted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org