[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush

Robert Muir (JIRA) Wed, 21 May 2014 17:23:44 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005450#comment-14005450
 ]


Robert Muir commented on LUCENE-5693:
-------------------------------------

{quote}
I disagree: I think we discover places that are "relying" on deleted docs 
behavior, i.e. test bugs. When I did this on LUCENE-5675 there were only a few 
places that relied on deleted docs.
{quote}

That's not the complexity i'm concerned about. I'm talking about bugs in lucene 
itself because shit like the following happens:
* various codec apis unable to cope with writing 0 doc segments because all the 
docs were deleted
* various codec apis with corner case bugs because stuff like 'maxdoc' in 
segmentinfo they are fed is inconsistent with what they saw.
* various index/search apis unable to cope with docid X appears in codec api Y 
but not codec api Z where its expected to exist.
* slow O(n) passes thru indexwriter apis to recalculate and reshuffle ordinals 
and stuff like that.
* corner case bugs like incorrect statistics.
* additional complexity inside indexwriter/codecs to handle this, when just 
merging away would be better.

So if we want to rename the issue to "as a special case, don't write deleted 
postings on flush" and remove the TODO about changing this for things like DV, 
then I'm fine.

But otherwise, if this is intended to be a precedent of how things should work, 
then I strongly feel we should not do this. The additional complexity and 
corner cases are simply not worth it.

> don't write deleted documents on flush
> --------------------------------------
>
>                 Key: LUCENE-5693
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5693
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-5693.patch
>
>
> When we flush a new segment, sometimes some documents are "born deleted", 
> e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed 
> documents.
> We already compute the liveDocs on flush, but then we continue (wastefully) 
> to send those known-deleted documents to all Codec parts.
> I started to implement this on LUCENE-5675 but it was too controversial.
> Also, I expect typically the number of deleted docs is 0, or small, so not 
> writing "born deleted" docs won't be much of a win for most apps.  Still it 
> seems silly to write them, consuming IO/CPU in the process, only to consume 
> more IO/CPU later for merging to re-delete them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush

Reply via email to