[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush
[ https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005735#comment-14005735 ] Shai Erera commented on LUCENE-5693: So the question is, if we think the common case is to have close to 0 deleted documents in a new flushed segment, how important is it to not write the postings of those documents. For example, I believe that many applications include stored fields in their Documents, and probably a good portion also TermVectors. Since those two are big, but we're not removing deleted documents from them, what's the advantage of not writing on postings of deleted docs, at the cost of introducing potential bugs as Rob mentions? And besides bugs, this does put the index into an inconsistent state -- yes, users should not rely on Lucene to be able to retrieve deleted documents, and we should have the freedom to optimize Lucene internals such that deleted docs are never flushed, but I just wonder if in this case, handling only postings while leaving the majority of the index as-is today, is worth the hassle and potential bugs. I'm +0.5 to make this change to only postings. If we can make this change global to all flushed content, including DocValues, I'd feel better about it. And I wonder if we can't ... so I understand this is not how the code works today, but if we computed the liveDocs before we flush any field, including DV, couldn't we change the code to not send deleted docs' fields to the DocValues API too? In that case we won't need to remap ordinals, right? We'd still be flushing stored fields and TV of deleted docs, but perhaps that's acceptable? don't write deleted documents on flush -- Key: LUCENE-5693 URL: https://issues.apache.org/jira/browse/LUCENE-5693 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-5693.patch When we flush a new segment, sometimes some documents are born deleted, e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed documents. We already compute the liveDocs on flush, but then we continue (wastefully) to send those known-deleted documents to all Codec parts. I started to implement this on LUCENE-5675 but it was too controversial. Also, I expect typically the number of deleted docs is 0, or small, so not writing born deleted docs won't be much of a win for most apps. Still it seems silly to write them, consuming IO/CPU in the process, only to consume more IO/CPU later for merging to re-delete them. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush
[ https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005736#comment-14005736 ] Michael McCandless commented on LUCENE-5693: bq. So if we want to rename the issue to as a special case, don't write deleted postings on flush and remove the TODO about changing this for things like DV, then I'm fine. +1, I'll rename the issue, remove the TODO about not sending deleted docs to DVs on flush (make it clear this is only about not writing deleted docs to postings), and add a separate test case for the ToParentBJQ.explain bug. don't write deleted documents on flush -- Key: LUCENE-5693 URL: https://issues.apache.org/jira/browse/LUCENE-5693 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-5693.patch When we flush a new segment, sometimes some documents are born deleted, e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed documents. We already compute the liveDocs on flush, but then we continue (wastefully) to send those known-deleted documents to all Codec parts. I started to implement this on LUCENE-5675 but it was too controversial. Also, I expect typically the number of deleted docs is 0, or small, so not writing born deleted docs won't be much of a win for most apps. Still it seems silly to write them, consuming IO/CPU in the process, only to consume more IO/CPU later for merging to re-delete them. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush
[ https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005740#comment-14005740 ] Michael McCandless commented on LUCENE-5693: bq. couldn't we change the code to not send deleted docs' fields to the DocValues API too? We could, but Rob is strongly against that, so I'll remove that TODO and make it clear this is just about not wasting IO/CPU writing deleted postings. Also, remember that postings is the most costly part of the merge, so not writing deleted docs there gets us the most gains. don't write deleted documents on flush -- Key: LUCENE-5693 URL: https://issues.apache.org/jira/browse/LUCENE-5693 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-5693.patch When we flush a new segment, sometimes some documents are born deleted, e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed documents. We already compute the liveDocs on flush, but then we continue (wastefully) to send those known-deleted documents to all Codec parts. I started to implement this on LUCENE-5675 but it was too controversial. Also, I expect typically the number of deleted docs is 0, or small, so not writing born deleted docs won't be much of a win for most apps. Still it seems silly to write them, consuming IO/CPU in the process, only to consume more IO/CPU later for merging to re-delete them. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush
[ https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005751#comment-14005751 ] ASF subversion and git services commented on LUCENE-5693: - Commit 1596783 from [~mikemccand] in branch 'dev/branches/lucene5675' [ https://svn.apache.org/r1596783 ] LUCENE-5693, LUCENE-5675: also decouple this bug fix (move to LUCENE-5693) in ToParentBJQ.explain don't write deleted documents on flush -- Key: LUCENE-5693 URL: https://issues.apache.org/jira/browse/LUCENE-5693 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-5693.patch When we flush a new segment, sometimes some documents are born deleted, e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed documents. We already compute the liveDocs on flush, but then we continue (wastefully) to send those known-deleted documents to all Codec parts. I started to implement this on LUCENE-5675 but it was too controversial. Also, I expect typically the number of deleted docs is 0, or small, so not writing born deleted docs won't be much of a win for most apps. Still it seems silly to write them, consuming IO/CPU in the process, only to consume more IO/CPU later for merging to re-delete them. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush
[ https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005754#comment-14005754 ] Shai Erera commented on LUCENE-5693: I don't see what's wrong w/ not sending the values of deleted docs to the DV API. As far as they're concerned, they get {{null}} values for Sorted/SortedSet/Binary/Numeric, which they should also be prepared for, as not all docs will have values even without deletes. And so there won't be any ord-remapping? Like if all docs associated w/ value foo are deleted, foo won't be sent to the Codec in the first place, and therefore will never receive an ord? I wonder how complicated it is to patch it up, so we can look at it. And perhaps we'd even be able to tell if there's any code that breaks. don't write deleted documents on flush -- Key: LUCENE-5693 URL: https://issues.apache.org/jira/browse/LUCENE-5693 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-5693.patch When we flush a new segment, sometimes some documents are born deleted, e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed documents. We already compute the liveDocs on flush, but then we continue (wastefully) to send those known-deleted documents to all Codec parts. I started to implement this on LUCENE-5675 but it was too controversial. Also, I expect typically the number of deleted docs is 0, or small, so not writing born deleted docs won't be much of a win for most apps. Still it seems silly to write them, consuming IO/CPU in the process, only to consume more IO/CPU later for merging to re-delete them. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush
[ https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005787#comment-14005787 ] Simon Willnauer commented on LUCENE-5693: - I really wonder if this issue matters. The usecase of this when you update a document while it's still in ram and to me this seems really a corner case. I think we should just stick with the corner case and not complicate the code if possible? If it makes things cleaner I am ok with it but please don't complicate things for this corner case... don't write deleted documents on flush -- Key: LUCENE-5693 URL: https://issues.apache.org/jira/browse/LUCENE-5693 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-5693.patch When we flush a new segment, sometimes some documents are born deleted, e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed documents. We already compute the liveDocs on flush, but then we continue (wastefully) to send those known-deleted documents to all Codec parts. I started to implement this on LUCENE-5675 but it was too controversial. Also, I expect typically the number of deleted docs is 0, or small, so not writing born deleted docs won't be much of a win for most apps. Still it seems silly to write them, consuming IO/CPU in the process, only to consume more IO/CPU later for merging to re-delete them. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush
[ https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005795#comment-14005795 ] Shai Erera commented on LUCENE-5693: Hmm, I reviewed SortedDVWriter and I understand the ord remapping problem that Rob was talking about. I was confused and thought that the Codec is the one that assigns the ords, and so if it receives a {{null}} value, it would be fine. But it's the SortedDVWriter which assigns the ords, and since it already assigned an ord to a value of a document that was later deleted, it would need to remap the ords around it. So maybe we do only need to focus on postings in this issue. I don't think that we need to remove the TODO though ... we have plenty of TODOs in the code, and it's something valid to consider one day, only keep this issue focused for now. don't write deleted documents on flush -- Key: LUCENE-5693 URL: https://issues.apache.org/jira/browse/LUCENE-5693 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-5693.patch When we flush a new segment, sometimes some documents are born deleted, e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed documents. We already compute the liveDocs on flush, but then we continue (wastefully) to send those known-deleted documents to all Codec parts. I started to implement this on LUCENE-5675 but it was too controversial. Also, I expect typically the number of deleted docs is 0, or small, so not writing born deleted docs won't be much of a win for most apps. Still it seems silly to write them, consuming IO/CPU in the process, only to consume more IO/CPU later for merging to re-delete them. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush
[ https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005828#comment-14005828 ] Michael McCandless commented on LUCENE-5693: bq. I really wonder if this issue matters. I suspect it's uncommon in most cases, that docs are born deleted. But it does happen and it seems silly to waste IO/CPU if we can help it. bq. I think we should just stick with the corner case and not complicate the code if possible? The patch really does not complicate the code? It adds a check against the liveDocs in the Docs/AndPositionsEnum passed to the codec during flush. The only complexity was fixing a test that made invalid assumption that deleted docs must be present in postings. I guess what bothers me here is this apparent precedent that deleted docs are in fact required to be present everywhere in a segment. Yes, this is the case today, but I think it's an impl detail and should not be required, e.g. enforced by CheckIndex, tests asserting that it's the case. But I'll resolve this as WONTFIX ... looks like I'm just outvoted. don't write deleted documents on flush -- Key: LUCENE-5693 URL: https://issues.apache.org/jira/browse/LUCENE-5693 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-5693.patch When we flush a new segment, sometimes some documents are born deleted, e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed documents. We already compute the liveDocs on flush, but then we continue (wastefully) to send those known-deleted documents to all Codec parts. I started to implement this on LUCENE-5675 but it was too controversial. Also, I expect typically the number of deleted docs is 0, or small, so not writing born deleted docs won't be much of a win for most apps. Still it seems silly to write them, consuming IO/CPU in the process, only to consume more IO/CPU later for merging to re-delete them. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush
[ https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005837#comment-14005837 ] Shai Erera commented on LUCENE-5693: bq. But I'll resolve this as WONTFIX ... looks like I'm just outvoted. I think you jumped to that conclusion too soon. The way I read it, there's one committer who +0.5, one who was OK w/ restricting this to postings only, and one who said that if it complicates the code, please don't do that -- but it doesn't. I don't think that's called outvoted :). But it's your call... don't write deleted documents on flush -- Key: LUCENE-5693 URL: https://issues.apache.org/jira/browse/LUCENE-5693 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-5693.patch When we flush a new segment, sometimes some documents are born deleted, e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed documents. We already compute the liveDocs on flush, but then we continue (wastefully) to send those known-deleted documents to all Codec parts. I started to implement this on LUCENE-5675 but it was too controversial. Also, I expect typically the number of deleted docs is 0, or small, so not writing born deleted docs won't be much of a win for most apps. Still it seems silly to write them, consuming IO/CPU in the process, only to consume more IO/CPU later for merging to re-delete them. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush
[ https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005862#comment-14005862 ] Robert Muir commented on LUCENE-5693: - {quote} I guess what bothers me here is this apparent precedent that deleted docs are in fact required to be present everywhere in a segment. Yes, this is the case today, but I think it's an impl detail and should not be required, e.g. enforced by CheckIndex, tests asserting that it's the case. {quote} Thats not the case. I am worried about *bugs, complexities, and slowdowns in lucene itself*. I already mentioned my list of concerns and I think they are all realistic. To me, the patch is a bit naive. Perhaps you forgot (or didn't think about) what Sorted/SortedSetDocValuesWriter would have to do, if it wanted to filter out deleted documents? This would slow down flushing a lot, which presumably is important to people who are deleting documents in IndexWriter's ramBuffer. Filtering out deleted documents here would only *hurt* the user. Better to leave this to merge. And what about stored fields and term vectors? why wouldn't you put a TODO there in your patch? Is it because its ok to have the API and system inconsistency there, because it would be slower to buffer them in RAM? I don't like these implicit exceptions to the rule. If we want to intentionally make a mess, there needs to be hard justification why we are doing such a thing. All these little unproven optimizations, API inconsistencies, exceptional cases, they all add up. I think it would be better to only complicate things when its a big win, otherwise the whole codebase will end out looking like IndexWriter.java. All that being said, as I already stated on this issue, I am fine with filtering out the postings as an exception to the rule. I really don't like it one bit: but i can compromise for this piece if it really brings big benefits. Doing it for the rest of the codec api makes no sense at all. don't write deleted documents on flush -- Key: LUCENE-5693 URL: https://issues.apache.org/jira/browse/LUCENE-5693 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-5693.patch When we flush a new segment, sometimes some documents are born deleted, e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed documents. We already compute the liveDocs on flush, but then we continue (wastefully) to send those known-deleted documents to all Codec parts. I started to implement this on LUCENE-5675 but it was too controversial. Also, I expect typically the number of deleted docs is 0, or small, so not writing born deleted docs won't be much of a win for most apps. Still it seems silly to write them, consuming IO/CPU in the process, only to consume more IO/CPU later for merging to re-delete them. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush
[ https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006306#comment-14006306 ] ASF subversion and git services commented on LUCENE-5693: - Commit 1596938 from [~mikemccand] in branch 'dev/branches/lucene5675' [ https://svn.apache.org/r1596938 ] LUCENE-5675, LUCENE-5693: improve javadocs, disallow term vectors, fix precommit issues, remove trivial diffs, add new test case don't write deleted documents on flush -- Key: LUCENE-5693 URL: https://issues.apache.org/jira/browse/LUCENE-5693 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-5693.patch When we flush a new segment, sometimes some documents are born deleted, e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed documents. We already compute the liveDocs on flush, but then we continue (wastefully) to send those known-deleted documents to all Codec parts. I started to implement this on LUCENE-5675 but it was too controversial. Also, I expect typically the number of deleted docs is 0, or small, so not writing born deleted docs won't be much of a win for most apps. Still it seems silly to write them, consuming IO/CPU in the process, only to consume more IO/CPU later for merging to re-delete them. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush
[ https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004821#comment-14004821 ] Robert Muir commented on LUCENE-5693: - This only makes sense for postings though. How can we avoid writing deleted documents in: * stored fields and term vectors (which we arent flushing) * docvalues (we would need to remap ordinals) By writing them some places and not writing them other places, we open the possibility of extremely confusing corner cases and bugs. don't write deleted documents on flush -- Key: LUCENE-5693 URL: https://issues.apache.org/jira/browse/LUCENE-5693 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless When we flush a new segment, sometimes some documents are born deleted, e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed documents. We already compute the liveDocs on flush, but then we continue (wastefully) to send those known-deleted documents to all Codec parts. I started to implement this on LUCENE-5675 but it was too controversial. Also, I expect typically the number of deleted docs is 0, or small, so not writing born deleted docs won't be much of a win for most apps. Still it seems silly to write them, consuming IO/CPU in the process, only to consume more IO/CPU later for merging to re-delete them. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush
[ https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004835#comment-14004835 ] Shai Erera commented on LUCENE-5693: Today we apply the deletes (update the bitset) when a Reader is being requested. At that point, we have a SegmentReader at hand and we can resolve the delete-by-Term/Query to the actual doc IDs ... how would we do that while the segment is flushed? How do we know which documents were associated with {{Term t}}, while it was sent as a delete? When I worked on LUCENE-5189 (NumericDocValues update), I had the same thought -- why flush the original numeric value when the document has already been updated? But I had the same issue - which documents were affected by the update Term. don't write deleted documents on flush -- Key: LUCENE-5693 URL: https://issues.apache.org/jira/browse/LUCENE-5693 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless When we flush a new segment, sometimes some documents are born deleted, e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed documents. We already compute the liveDocs on flush, but then we continue (wastefully) to send those known-deleted documents to all Codec parts. I started to implement this on LUCENE-5675 but it was too controversial. Also, I expect typically the number of deleted docs is 0, or small, so not writing born deleted docs won't be much of a win for most apps. Still it seems silly to write them, consuming IO/CPU in the process, only to consume more IO/CPU later for merging to re-delete them. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush
[ https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004902#comment-14004902 ] Michael McCandless commented on LUCENE-5693: bq. how would we do that while the segment is flushed? We do it in FreqProxTermsWriter.applyDeletes; since we know the terms to be deleted, and we have the BytesRefHash, it's easy. don't write deleted documents on flush -- Key: LUCENE-5693 URL: https://issues.apache.org/jira/browse/LUCENE-5693 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless When we flush a new segment, sometimes some documents are born deleted, e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed documents. We already compute the liveDocs on flush, but then we continue (wastefully) to send those known-deleted documents to all Codec parts. I started to implement this on LUCENE-5675 but it was too controversial. Also, I expect typically the number of deleted docs is 0, or small, so not writing born deleted docs won't be much of a win for most apps. Still it seems silly to write them, consuming IO/CPU in the process, only to consume more IO/CPU later for merging to re-delete them. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush
[ https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004906#comment-14004906 ] Michael McCandless commented on LUCENE-5693: bq. This only makes sense for postings though. Right, postings is much easier than doc values. But postings are also the most costly to merge. bq. By writing them some places and not writing them other places, we open the possibility of extremely confusing corner cases and bugs. I disagree: I think we discover places that are relying on deleted docs behavior, i.e. test bugs. When I did this on LUCENE-5675 there were only a few places that relied on deleted docs. don't write deleted documents on flush -- Key: LUCENE-5693 URL: https://issues.apache.org/jira/browse/LUCENE-5693 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless When we flush a new segment, sometimes some documents are born deleted, e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed documents. We already compute the liveDocs on flush, but then we continue (wastefully) to send those known-deleted documents to all Codec parts. I started to implement this on LUCENE-5675 but it was too controversial. Also, I expect typically the number of deleted docs is 0, or small, so not writing born deleted docs won't be much of a win for most apps. Still it seems silly to write them, consuming IO/CPU in the process, only to consume more IO/CPU later for merging to re-delete them. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5693) don't write deleted documents on flush
[ https://issues.apache.org/jira/browse/LUCENE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005450#comment-14005450 ] Robert Muir commented on LUCENE-5693: - {quote} I disagree: I think we discover places that are relying on deleted docs behavior, i.e. test bugs. When I did this on LUCENE-5675 there were only a few places that relied on deleted docs. {quote} That's not the complexity i'm concerned about. I'm talking about bugs in lucene itself because shit like the following happens: * various codec apis unable to cope with writing 0 doc segments because all the docs were deleted * various codec apis with corner case bugs because stuff like 'maxdoc' in segmentinfo they are fed is inconsistent with what they saw. * various index/search apis unable to cope with docid X appears in codec api Y but not codec api Z where its expected to exist. * slow O(n) passes thru indexwriter apis to recalculate and reshuffle ordinals and stuff like that. * corner case bugs like incorrect statistics. * additional complexity inside indexwriter/codecs to handle this, when just merging away would be better. So if we want to rename the issue to as a special case, don't write deleted postings on flush and remove the TODO about changing this for things like DV, then I'm fine. But otherwise, if this is intended to be a precedent of how things should work, then I strongly feel we should not do this. The additional complexity and corner cases are simply not worth it. don't write deleted documents on flush -- Key: LUCENE-5693 URL: https://issues.apache.org/jira/browse/LUCENE-5693 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-5693.patch When we flush a new segment, sometimes some documents are born deleted, e.g. if the app did a IW.deleteDocuments that matched some not-yet-flushed documents. We already compute the liveDocs on flush, but then we continue (wastefully) to send those known-deleted documents to all Codec parts. I started to implement this on LUCENE-5675 but it was too controversial. Also, I expect typically the number of deleted docs is 0, or small, so not writing born deleted docs won't be much of a win for most apps. Still it seems silly to write them, consuming IO/CPU in the process, only to consume more IO/CPU later for merging to re-delete them. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org