[jira] [Commented] (LUCENE-7344) Deletion by query of uncommitted docs not working with DV updates
[ https://issues.apache.org/jira/browse/LUCENE-7344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15415339#comment-15415339 ] Yonik Seeley commented on LUCENE-7344: -- bq. Here, "doc-1" should not be deleted, because the DBQ is submitted before the DV update, but because we resolve all DV updates before DBQ (in this patch), it ends up deleted. For Solr, this can never happen... All updates are blocked during A DBQ, and then we re-open the reader before releasing the lock. All we need at the Solr level is to resolve DV updates *before* resolving deletes (since we know that the DQB will always be last). > Deletion by query of uncommitted docs not working with DV updates > - > > Key: LUCENE-7344 > URL: https://issues.apache.org/jira/browse/LUCENE-7344 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ishan Chattopadhyaya > Attachments: LUCENE-7344.patch, LUCENE-7344.patch, LUCENE-7344.patch > > > When DVs are updated, delete by query doesn't work with the updated DV value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7344) Deletion by query of uncommitted docs not working with DV updates
[ https://issues.apache.org/jira/browse/LUCENE-7344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15414906#comment-15414906 ] Robert Muir commented on LUCENE-7344: - The performance of this seems really trappy, I am not sure we should do this? Maybe we should just document the limitation unless there is a cleaner way. > Deletion by query of uncommitted docs not working with DV updates > - > > Key: LUCENE-7344 > URL: https://issues.apache.org/jira/browse/LUCENE-7344 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ishan Chattopadhyaya > Attachments: LUCENE-7344.patch, LUCENE-7344.patch, LUCENE-7344.patch > > > When DVs are updated, delete by query doesn't work with the updated DV value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7344) Deletion by query of uncommitted docs not working with DV updates
[ https://issues.apache.org/jira/browse/LUCENE-7344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15414871#comment-15414871 ] Shai Erera commented on LUCENE-7344: bq. I don't understand most of what you're saying To clarify the problem, both for you but also in the interest of writing a detailed plan of the proposed solution: currently when a DBQ is processed, it uses the LeafReader *without* the NDV updates, and therefore has no knowledge of the updated values. This is relatively easily solved in the patch I uploaded, by applying the DV updates before the DBQ is processed. That way, the DBQ uses a LeafReader which is already aware of the updates and all works well. However, there is an order of update operations that occur in IndexWriter. In our case it could be a mix in of DBQ and NDV updates. So if we apply *all* the DV updates before any of the DBQs, we'll get incorrect results where the DBQ either delete a document it shouldn't (see code example above, and also what your {{testDeleteFollowedByUpdateOfDeletedValue}} shows), or not delete a document that it should. To properly solve this problem, we need to apply the DV updates and DBQs in the order they were received (as opposed to applying them in bulk in current code). Meaning if the order of operations is NDVU1, NDVU2, DBQ1, NDVU3, DBQ2, DBQ3, NDVU4, then we need to: # Apply NDVU1 + NDVU2; this will cause a new LeafReader to be created # Apply DBQ1; using the already updated LeafReader # Apply NDVU3; another LeafReader will be created, now reflecting all 3 NDV updates # Apply DBQ2 and DBQ3; using the updated LeafReader from above # Apply NDVU4; this will cause another LeafReader to be created The adversarial affect in this case is that we cause 3 LeafReader reopens, each time (due to how NDV updates are currently implemented) writing the full DV field to a new stack. If you have many documents, it's going to be very expensive. Also, if you have a bigger sequence of interleaving updates and deletes, this gets worse and worse. And so here comes the optimization that Mike and I discussed above. Since the NDV updates are held in-memory until they're applied, we can avoid flushing them to disk and creating a LeafReader which reads the original DV field + the in-memory DV updates. Note though: not *all* DV updates, but only the ones that are relevant up until this point. So in the case above, that LeafReader will view only NDVU1 and NDVU2, and later it will be updated to view NDVU3 as well. This is purely an optimization step and has nothing to do with correctness (of course, that optimization is tricky and needs to be implemented correctly!). Therefore my plan of attack in this case is: # Have enough tests that try different cases before any of this is implemented. For example, Mike proposed above to have the LeafReader + DV field "view" use docIdUpto. I need to check the code again, but I want to make sure that if NDVU2, NDVU3 and NDVU4 (with the interleaving DBQs) all affect the *same* document, everything still works. # Implement the less-efficient approach, i.e. flush the DV updates to disk before each DBQ is processed. This ensures that we have a proper solution implemented, and we leave the optimization to a later step (either literally a later commit, or just a different patch or whatever). I think this is complicated enough to start with. # Improve the solution to avoid flushing DV updates between the DBQs, as proposed above. bq. testBiasedMixOfRandomUpdates I briefly reviewed the test, but not thoroughly (I intend to). However, notice that committing (hard/soft ; commit/NRT) completely avoids the problem because a commit/NRT already means flushing DV updates. So if that's what this test does, I don't think it's going to expose the problem. Perhaps with the explanation I wrote above, you can revisit the test and make it fail though. > Deletion by query of uncommitted docs not working with DV updates > - > > Key: LUCENE-7344 > URL: https://issues.apache.org/jira/browse/LUCENE-7344 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ishan Chattopadhyaya > Attachments: LUCENE-7344.patch, LUCENE-7344.patch, LUCENE-7344.patch > > > When DVs are updated, delete by query doesn't work with the updated DV value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7344) Deletion by query of uncommitted docs not working with DV updates
[ https://issues.apache.org/jira/browse/LUCENE-7344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15413364#comment-15413364 ] Shai Erera commented on LUCENE-7344: After chatting with Mike about this, here's an example for an "interleaving" case that Mike mentioned, where this patch does not work: {code} writer.updateNumericDocValue(new Term("id", "doc-1"), "val", 17L); writer.deleteDocuments(DocValuesRangeQuery.newLongRange("val", 5L, 10L, true, true)); writer.updateNumericDocValue(new Term("id", "doc-1"), "val", 7L); {code} Here, "doc-1" should not be deleted, because the DBQ is submitted before the DV update, but because we resolve all DV updates before DBQ (in this patch), it ends up deleted. This is wrong of course. I'm looking into Mike's other idea of having a LeafReader view with the DV updates up until that document, and then ensuring DV updates / DBQs are applied in the order they were submitted. This starts to get very complicated. > Deletion by query of uncommitted docs not working with DV updates > - > > Key: LUCENE-7344 > URL: https://issues.apache.org/jira/browse/LUCENE-7344 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ishan Chattopadhyaya > Attachments: LUCENE-7344.patch, LUCENE-7344.patch > > > When DVs are updated, delete by query doesn't work with the updated DV value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7344) Deletion by query of uncommitted docs not working with DV updates
[ https://issues.apache.org/jira/browse/LUCENE-7344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15413235#comment-15413235 ] Michael McCandless commented on LUCENE-7344: Maybe a better option is to remove the doc values based queries. One really shouldn't be querying based on doc values, i.e. the values should be indexed as points if range querying is going to be used. > Deletion by query of uncommitted docs not working with DV updates > - > > Key: LUCENE-7344 > URL: https://issues.apache.org/jira/browse/LUCENE-7344 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ishan Chattopadhyaya > Attachments: LUCENE-7344.patch > > > When DVs are updated, delete by query doesn't work with the updated DV value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7344) Deletion by query of uncommitted docs not working with DV updates
[ https://issues.apache.org/jira/browse/LUCENE-7344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15413234#comment-15413234 ] Michael McCandless commented on LUCENE-7344: Maybe another option is to somehow make a wrapped {{LeafReader}} that exposes the point-in-time view for its delegate, but will also expose the live pending updated DVs when doc values are requested? This would need to expose something that says how far (by docID) into those updates should this reader expose, per DBQ that's applied. This should avoid the adversarial cases I think, or at least greatly minimize their impact. > Deletion by query of uncommitted docs not working with DV updates > - > > Key: LUCENE-7344 > URL: https://issues.apache.org/jira/browse/LUCENE-7344 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ishan Chattopadhyaya > Attachments: LUCENE-7344.patch > > > When DVs are updated, delete by query doesn't work with the updated DV value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7344) Deletion by query of uncommitted docs not working with DV updates
[ https://issues.apache.org/jira/browse/LUCENE-7344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15413216#comment-15413216 ] Shai Erera commented on LUCENE-7344: Hmm ... I had to refresh my memory of the DV updates code and I agree with [~mikemccand] that the fix is hairy (which goes hand-in-hand with the hairy {{BufferedUpdatesStream}}). The problem is that deleteByQuery uses the existing LeafReader, but the DV updates themselves were not yet applied so the reader is unaware of the change. I changed the test to call {{updateDocument}} instead of updating the NDV and the test passes. This is expected because updating a document deletes the old one and adds a new document. So when DBQ is processed, a LeafReader is opened on the new segment (with the new document; it has to work that way cause the new document isn't yet flushed) and the new segment thus has the new document with the updated NDV. I agree this is a bug *only* because updating a document followed by DBQ works as expected. The internals of how in-place updates are applied should not concern the user. I wonder if we need to implement a complex merge-sorting approach as [~mikemccand] proposes, or if we applied the DV updates before processing and DBQ would be enough (ignoring the adversarial affects that Mike describes; they're true, but I ignore them for the moment). I want to try that. If that works, then perhaps we can detect if a DBQ involves an NDV field (or BDV field for that matter) and refresh the reader only then, or refresh the reader whenever there are DBQ and any DV updates, even if they are unrelated. But first I want to try and make the test pass, before we decide on how to properly fix it. > Deletion by query of uncommitted docs not working with DV updates > - > > Key: LUCENE-7344 > URL: https://issues.apache.org/jira/browse/LUCENE-7344 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ishan Chattopadhyaya > Attachments: LUCENE-7344.patch > > > When DVs are updated, delete by query doesn't work with the updated DV value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7344) Deletion by query of uncommitted docs not working with DV updates
[ https://issues.apache.org/jira/browse/LUCENE-7344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15338486#comment-15338486 ] Michael McCandless commented on LUCENE-7344: This is indeed a bug, but I can't yet see how we can fix it. The hairy code that resolves deletes and DV updates ({{BufferedUpdateStream.java}} has a hard assumption that segments are write-once as far as queries resolving to docIDs is concerned, and updatable DVs breaks that assumption. To properly fix this, at a high level, the code would need to do something along the lines of merge-sorting all DV updates and delete-by-query in order of their arrival to {{IndexWriter}} ({{docIDUpto}} proxies this), and then apply these in order, but after any DV update and just before a delete-by-query, we'd need to open a new {{LeafReader}} on that segment so it would reflect the DV updates (maybe just calling {{ReadersAndUpdates.writeFieldUpdates}}, and then pulling the reopened reader, would suffice for this). But even if we can fix that, the code would then have an inherent terrible adversarial worst case of alternating DV update / DBQ. Rather than fix this bug, I would lean towards requiring/documenting that DBQ must not use queries that rely on updated doc values. > Deletion by query of uncommitted docs not working with DV updates > - > > Key: LUCENE-7344 > URL: https://issues.apache.org/jira/browse/LUCENE-7344 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ishan Chattopadhyaya > Attachments: LUCENE-7344.patch > > > When DVs are updated, delete by query doesn't work with the updated DV value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org