[ 
https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005735#comment-14005735
 ] 

Shai Erera commented on LUCENE-5693:
------------------------------------

So the question is, if we think the common case is to have close to 0 deleted 
documents in a new flushed segment, how important is it to not write the 
postings of those documents. For example, I believe that many applications 
include stored fields in their Documents, and probably a good portion also 
TermVectors. Since those two are big, but we're not removing deleted documents 
from them, what's the advantage of not writing on postings of deleted docs, at 
the cost of introducing potential bugs as Rob mentions?

And besides bugs, this does put the index into an inconsistent state -- yes, 
users should not rely on Lucene to be able to retrieve deleted documents, and 
we should have the freedom to optimize Lucene internals such that deleted docs 
are never flushed, but I just wonder if in this case, handling only postings 
while leaving the majority of the index as-is today, is worth the hassle and 
potential bugs.

I'm +0.5 to make this change to only postings.

If we can make this change global to all flushed content, including DocValues, 
I'd feel better about it. And I wonder if we can't ... so I understand this is 
not how the code works today, but if we computed the liveDocs before we flush 
any field, including DV, couldn't we change the code to not send deleted docs' 
fields to the DocValues API too? In that case we won't need to remap ordinals, 
right? We'd still be flushing stored fields and TV of deleted docs, but perhaps 
that's acceptable?

> don't write deleted documents on flush
> --------------------------------------
>
>                 Key: LUCENE-5693
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5693
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-5693.patch
>
>
> When we flush a new segment, sometimes some documents are "born deleted", 
> e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed 
> documents.
> We already compute the liveDocs on flush, but then we continue (wastefully) 
> to send those known-deleted documents to all Codec parts.
> I started to implement this on LUCENE-5675 but it was too controversial.
> Also, I expect typically the number of deleted docs is 0, or small, so not 
> writing "born deleted" docs won't be much of a win for most apps.  Still it 
> seems silly to write them, consuming IO/CPU in the process, only to consume 
> more IO/CPU later for merging to re-delete them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to