jpountz commented on code in PR #13430:
URL: https://github.com/apache/lucene/pull/13430#discussion_r1625497128
##########
lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java:
##########
@@ -522,21 +550,28 @@ private MergeSpecification doFindMerges(
final List<SegmentCommitInfo> candidate = new ArrayList<>();
Review Comment:
Merging is only necessary if there are more segments than necessary or more
deletes than necessary. So this condition just stops looking for more merges
when none of these two conditions are met?
##########
lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java:
##########
@@ -522,21 +550,28 @@ private MergeSpecification doFindMerges(
final List<SegmentCommitInfo> candidate = new ArrayList<>();
boolean hitTooLarge = false;
long bytesThisMerge = 0;
+ long docCountThisMerge = 0;
for (int idx = startIdx;
idx < sortedEligible.size()
&& candidate.size() < mergeFactor
- && bytesThisMerge < maxMergedSegmentBytes;
+ && bytesThisMerge < maxMergedSegmentBytes
+ && docCountThisMerge < allowedDocCount;
idx++) {
final SegmentSizeAndDocs segSizeDocs = sortedEligible.get(idx);
final long segBytes = segSizeDocs.sizeInBytes;
-
+ int segDocCount = segSizeDocs.maxDoc - segSizeDocs.delCount;
+ if (docCountThisMerge + segDocCount > allowedDocCount) {
+ // We don't want to merge segments that will produce more
documents than allowedDocCount
+ continue;
Review Comment:
I think we should `break` here, not `continue`. `continue` allows producing
merges of segments whose sizes are not adjacent, I don't think we should allow
this for the doc count condition, as this potentially makes merging run in
quadratic time.
##########
lucene/core/src/test/org/apache/lucene/index/TestTieredMergePolicy.java:
##########
@@ -77,20 +77,23 @@ protected void assertSegmentInfos(MergePolicy policy,
SegmentInfos infos) throws
long levelSizeBytes = Math.max(minSegmentBytes, (long)
(tmp.getFloorSegmentMB() * 1024 * 1024));
long bytesLeft = totalBytes;
double allowedSegCount = 0;
+ final int maxNumSegmentsOnHighestTier =
+ (int) Math.max(tmp.getSegmentsPerTier(),
tmp.getTargetSearchConcurrency());
// below we make the assumption that segments that reached the max segment
// size divided by 2 don't need merging anymore
int mergeFactor = (int) Math.min(tmp.getSegmentsPerTier(),
tmp.getMaxMergeAtOnce());
while (true) {
final double segCountLevel = bytesLeft / (double) levelSizeBytes;
- if (segCountLevel < tmp.getSegmentsPerTier() || levelSizeBytes >=
maxMergedSegmentBytes / 2) {
+ if (segCountLevel < maxNumSegmentsOnHighestTier
Review Comment:
Should it allow `maxNumSegmentsOnHighestTier` for consistency with the code
under `main`? Though I don't expect it to make a big difference since
`segCountLevel` is a double.
```suggestion
if (segCountLevel <= maxNumSegmentsOnHighestTier
```
##########
lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java:
##########
@@ -522,21 +550,28 @@ private MergeSpecification doFindMerges(
final List<SegmentCommitInfo> candidate = new ArrayList<>();
boolean hitTooLarge = false;
long bytesThisMerge = 0;
+ long docCountThisMerge = 0;
for (int idx = startIdx;
idx < sortedEligible.size()
&& candidate.size() < mergeFactor
- && bytesThisMerge < maxMergedSegmentBytes;
+ && bytesThisMerge < maxMergedSegmentBytes
+ && docCountThisMerge < allowedDocCount;
idx++) {
final SegmentSizeAndDocs segSizeDocs = sortedEligible.get(idx);
final long segBytes = segSizeDocs.sizeInBytes;
-
+ int segDocCount = segSizeDocs.maxDoc - segSizeDocs.delCount;
+ if (docCountThisMerge + segDocCount > allowedDocCount) {
+ // We don't want to merge segments that will produce more
documents than allowedDocCount
+ continue;
Review Comment:
We probably also should produce a singleton merge here in case
`candidate.isEmpty()` (see what we're doing for too large segments below). Most
likely this suggested merge will not be selected because it will have a low
score, though it could end up selected if it reclaims lots of deletes and we
don't have better merges.
##########
lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java:
##########
@@ -916,14 +951,13 @@ public MergeSpecification
findForcedDeletesMerges(SegmentInfos infos, MergeConte
final Set<SegmentCommitInfo> merging = mergeContext.getMergingSegments();
boolean haveWork = false;
+ int totalDelCount = 0;
for (SegmentCommitInfo info : infos) {
int delCount = mergeContext.numDeletesToMerge(info);
assert assertDelCount(delCount, info);
+ totalDelCount += delCount;
double pctDeletes = 100. * ((double) delCount) / info.info.maxDoc();
- if (pctDeletes > forceMergeDeletesPctAllowed && !merging.contains(info))
{
- haveWork = true;
- break;
- }
+ haveWork = haveWork || (pctDeletes > forceMergeDeletesPctAllowed &&
!merging.contains(info));
Review Comment:
Why don't we exit the loop anymore when `haveWork` is true?
##########
lucene/core/src/test/org/apache/lucene/index/TestTieredMergePolicy.java:
##########
@@ -111,11 +114,29 @@ protected void assertSegmentInfos(MergePolicy policy,
SegmentInfos infos) throws
}
}
+ // There can be more segments if we can't merge docs - they are balanced
between segments
+ int maxDocsPerSegment = tmp.getMaxAllowedDocs(infos.totalMaxDoc());
+ List<Integer> segmentDocs =
+ infos.asList().stream()
+ .map(info -> info.info.maxDoc() - info.getDelCount())
+ .sorted()
+ .toList();
+ int numEligible = 0;
+ int currentSum = 0;
+ for (int i = 0; i < segmentDocs.size(); i++) {
+ currentSum += segmentDocs.get(i);
+ if (currentSum > maxDocsPerSegment) {
+ break;
+ }
+ numEligible++;
+ }
+ boolean eligibleDocsMerge = numEligible > 1;
Review Comment:
I wonder if the condition can be made simpler. Would it be ok to check the
sum of the doc counts of the two smallest segments? If it's greater than the
allowed doc count, then no merging is allowed, otherwise this 2-segments merge
would have been a legal merge?
##########
lucene/core/src/test/org/apache/lucene/index/TestTieredMergePolicy.java:
##########
@@ -111,11 +114,29 @@ protected void assertSegmentInfos(MergePolicy policy,
SegmentInfos infos) throws
}
}
+ // There can be more segments if we can't merge docs - they are balanced
between segments
+ int maxDocsPerSegment = tmp.getMaxAllowedDocs(infos.totalMaxDoc());
Review Comment:
should we subtract deletions to totalMaxDoc?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]