[
https://issues.apache.org/jira/browse/LUCENE-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150675#comment-13150675
]
Hoss Man commented on LUCENE-3577:
----------------------------------
bq. If there are just a few deletes in a few small segments, using optimize
instead of expungeDeletes is much more expensive?
that's what i was wondering ...
most incrementally updated indexes i've seen related to structured content (ie:
products, news, blogs, patents, etc...) the "recent" documents are the only
things likely to get updates (ie: a news story published in the past hour has a
decent change of getting an update, a news story published yesterday might get
a typo fixed, but a news story published a year ago isn't likely to ever get
updated) so in a traditional merged segment structure the newer/smaller
segments are the only ones that tend to have delets -- the bigger older
segments are mostly stagnant except when involved in merging. An expungeDelets
call that only touches the small "recent" segments is going to be a lot faster
then a full optimize, correct?
bq. Although, it doesn't really seem like an important use case (ensuring there
are no deletes).
I'm constantly surprised by the number of people who are really picky about
ensuring that their tf/idf numbers are *exact* because they use them in a weird
way -- it's definitely an expert level concern, but if those people are willing
to spend the time expunging deletes and we already have the code, might as well
leave it in right?
i think this is really just a question of naming/documentation: the method
doesn't sound as sexy as optimize, but if someone stumbles upon it and thinks
"oh wow, i guess i have to call this for my deletes to really be deleted"
that's bad. likewise the javadocs encourage/imply that it this method *should*
be called, instead of just explaining that it *can* be called and what it does.
I don't have a good suggestion for the name, but the doc is really the issue...
{quote}
...When an index has many document deletions (or updates to existing
documents), it's best to either call optimize or expungeDeletes to remove all
unused data in the index associated with the deleted documents. To see how many
deletions you have pending in your index, call IndexReader.numDeletedDocs()
This saves disk space and memory usage while searching. ...
{quote}
...nothing in that description describes the downsides/cost of the method.
> rename expungeDeletes
> ---------------------
>
> Key: LUCENE-3577
> URL: https://issues.apache.org/jira/browse/LUCENE-3577
> Project: Lucene - Java
> Issue Type: Task
> Reporter: Robert Muir
>
> Similar to optimize(), expungeDeletes() has a misleading name.
> We already had problems with this on the user list because TieredMergePolicy
> didn't 'expunge' all their deletes.
> Also I think expunge is the wrong word, because expunge makes it seem
> like you just wrangle up the deletes and kick them out of the party and
> that it should be fast.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]