My question is in the context of Solr, but I think it would probably be
best implemented in Lucene, for the benefit of all Lucene-based
software.  I'm describing it here to decide whether I should raise an issue.

I'm after something that would simply rewrite any segment containing
deleted documents, without actually merging the segments.  It would be
*like* a merge, except that it would usually merge one segment to one
segment, instead of many to one.

If the deleted documents are evenly scattered across the whole index
(shard), simply doing forceMerge might be just as efficient, assuming
disk space is not a concern.  A use case with highly-bunched deletes and
a relatively large number of segments would only need to work on some of
the segments, and would complete faster.  I suspect that bunched deletes
are probably common in actual user indexes, at least for the ones where
most deletes are related to document updates.

I don't know what this operation would be called.  I can start the
bikeshedding with something like wipeDeletes.  Using expungeDeletes
would be awesome, but this name is already used as a parameter for
another operation, at least in Solr.

I can imagine two methods, one which has no arguments and one that takes
two float percentage thresholds.

For the second method, the thresholds would control what happens if the
space used by segments with deletes is above or below the threshold. 
The first threshold, which might be called "mergeThreshold" would merge
the segments with deletes into a single segment IF the space used by the
segments with deletes is less than or equal to that percentage of the
whole index.  The second threshold, which might be called
"forceMergeThreshold" would change the request into a forceMerge if the
amount of space used by the segments with deletes is greater than or
equal to that percentage of the whole index.

The no-arg method could go two ways:  Either it *only* rewrites segments
one to one (maybe calling the other method with Float.MIN_VALUE for both
arguments), or it assigns reasonable default values to the two
thresholds, perhaps 30 and 90 percent.

On my dev server, optimizing a 33GB index shard takes over 3500 seconds
-- close to an hour.  I only do the optimize (forceMerge in Lucene) to
clean out deletes so they don't accumulate.  Any performance increase
that I obtain is a nice bonus -- not the reason for the optimize.

I would expect the operation I am describing here to take a fraction of
that time, if it is run on an index that has never been optimized.  My
TMP settings are roughly equivalent to a mergeFactor of 35.  I have the
potential for many segments.

<mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
  <int name="maxMergeAtOnce">35</int>
  <int name="segmentsPerTier">35</int>
  <int name="maxMergeAtOnceExplicit">105</int>
</mergePolicy>

Most of my deletes are concentrated in the most recently added
documents.  Normal merging will eliminate some of them, and most of what
is left will be in the first tier of merged segments, which should be
pretty small.  Getting rid of deleted documents should be very efficient
on my indexes with this operation.

Thanks,
Shawn


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to