[jira] [Commented] (LUCENE-7344) Deletion by query of uncommitted docs not working with DV updates

2016-08-10 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15415339#comment-15415339
 ] 

Yonik Seeley commented on LUCENE-7344:
--

bq. Here, "doc-1" should not be deleted, because the DBQ is submitted before 
the DV update, but because we resolve all DV updates before DBQ (in this 
patch), it ends up deleted.

For Solr, this can never happen...  All updates are blocked during A DBQ, and 
then we re-open the reader before releasing the lock.
All we need at the Solr level is to resolve DV updates *before* resolving 
deletes (since we know that the DQB will always be last).

> Deletion by query of uncommitted docs not working with DV updates
> -
>
> Key: LUCENE-7344
> URL: https://issues.apache.org/jira/browse/LUCENE-7344
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ishan Chattopadhyaya
> Attachments: LUCENE-7344.patch, LUCENE-7344.patch, LUCENE-7344.patch
>
>
> When DVs are updated, delete by query doesn't work with the updated DV value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7344) Deletion by query of uncommitted docs not working with DV updates

2016-08-10 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15414906#comment-15414906
 ] 

Robert Muir commented on LUCENE-7344:
-

The performance of this seems really trappy, I am not sure we should do this?

Maybe we should just document the limitation unless there is a cleaner way.

> Deletion by query of uncommitted docs not working with DV updates
> -
>
> Key: LUCENE-7344
> URL: https://issues.apache.org/jira/browse/LUCENE-7344
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ishan Chattopadhyaya
> Attachments: LUCENE-7344.patch, LUCENE-7344.patch, LUCENE-7344.patch
>
>
> When DVs are updated, delete by query doesn't work with the updated DV value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7344) Deletion by query of uncommitted docs not working with DV updates

2016-08-10 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15414871#comment-15414871
 ] 

Shai Erera commented on LUCENE-7344:


bq. I don't understand most of what you're saying

To clarify the problem, both for you but also in the interest of writing a 
detailed plan of the proposed solution: currently when a DBQ is processed, it 
uses the LeafReader *without* the NDV updates, and therefore has no knowledge 
of the updated values. This is relatively easily solved in the patch I 
uploaded, by applying the DV updates before the DBQ is processed. That way, the 
DBQ uses a LeafReader which is already aware of the updates and all works well.

However, there is an order of update operations that occur in IndexWriter. In 
our case it could be a mix in of DBQ and NDV updates. So if we apply *all* the 
DV updates before any of the DBQs, we'll get incorrect results where the DBQ 
either delete a document it shouldn't (see code example above, and also what 
your {{testDeleteFollowedByUpdateOfDeletedValue}} shows), or not delete a 
document that it should.

To properly solve this problem, we need to apply the DV updates and DBQs in the 
order they were received (as opposed to applying them in bulk in current code). 
Meaning if the order of operations is NDVU1, NDVU2, DBQ1, NDVU3, DBQ2, DBQ3, 
NDVU4, then we need to:
# Apply NDVU1 + NDVU2; this will cause a new LeafReader to be created
# Apply DBQ1; using the already updated LeafReader
# Apply NDVU3; another LeafReader will be created, now reflecting all 3 NDV 
updates
# Apply DBQ2 and DBQ3; using the updated LeafReader from above
# Apply NDVU4; this will cause another LeafReader to be created

The adversarial affect in this case is that we cause 3 LeafReader reopens, each 
time (due to how NDV updates are currently implemented) writing the full DV 
field to a new stack. If you have many documents, it's going to be very 
expensive. Also, if you have a bigger sequence of interleaving updates and 
deletes, this gets worse and worse.

And so here comes the optimization that Mike and I discussed above. Since the 
NDV updates are held in-memory until they're applied, we can avoid flushing 
them to disk and creating a LeafReader which reads the original DV field + the 
in-memory DV updates. Note though: not *all* DV updates, but only the ones that 
are relevant up until this point. So in the case above, that LeafReader will 
view only NDVU1 and NDVU2, and later it will be updated to view NDVU3 as well.

This is purely an optimization step and has nothing to do with correctness (of 
course, that optimization is tricky and needs to be implemented correctly!). 
Therefore my plan of attack in this case is:

# Have enough tests that try different cases before any of this is implemented. 
For example, Mike proposed above to have the LeafReader + DV field "view" use 
docIdUpto. I need to check the code again, but I want to make sure that if 
NDVU2, NDVU3 and NDVU4 (with the interleaving DBQs) all affect the *same* 
document, everything still works.
# Implement the less-efficient approach, i.e. flush the DV updates to disk 
before each DBQ is processed. This ensures that we have a proper solution 
implemented, and we leave the optimization to a later step (either literally a 
later commit, or just a different patch or whatever). I think this is 
complicated enough to start with.
# Improve the solution to avoid flushing DV updates between the DBQs, as 
proposed above.

bq. testBiasedMixOfRandomUpdates

I briefly reviewed the test, but not thoroughly (I intend to). However, notice 
that committing (hard/soft ; commit/NRT) completely avoids the problem because 
a commit/NRT already means flushing DV updates. So if that's what this test 
does, I don't think it's going to expose the problem. Perhaps with the 
explanation I wrote above, you can revisit the test and make it fail though.

> Deletion by query of uncommitted docs not working with DV updates
> -
>
> Key: LUCENE-7344
> URL: https://issues.apache.org/jira/browse/LUCENE-7344
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ishan Chattopadhyaya
> Attachments: LUCENE-7344.patch, LUCENE-7344.patch, LUCENE-7344.patch
>
>
> When DVs are updated, delete by query doesn't work with the updated DV value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7344) Deletion by query of uncommitted docs not working with DV updates

2016-08-09 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15413364#comment-15413364
 ] 

Shai Erera commented on LUCENE-7344:


After chatting with Mike about this, here's an example for an "interleaving" 
case that Mike mentioned, where this patch does not work:

{code}
writer.updateNumericDocValue(new Term("id", "doc-1"), "val", 17L);
writer.deleteDocuments(DocValuesRangeQuery.newLongRange("val", 5L, 10L, 
true, true));
writer.updateNumericDocValue(new Term("id", "doc-1"), "val", 7L);
{code}

Here, "doc-1" should not be deleted, because the DBQ is submitted before the DV 
update, but because we resolve all DV updates before DBQ (in this patch), it 
ends up deleted. This is wrong of course. I'm looking into Mike's other idea of 
having a LeafReader view with the DV updates up until that document, and then 
ensuring DV updates / DBQs are applied in the order they were submitted. This 
starts to get very complicated.

> Deletion by query of uncommitted docs not working with DV updates
> -
>
> Key: LUCENE-7344
> URL: https://issues.apache.org/jira/browse/LUCENE-7344
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ishan Chattopadhyaya
> Attachments: LUCENE-7344.patch, LUCENE-7344.patch
>
>
> When DVs are updated, delete by query doesn't work with the updated DV value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7344) Deletion by query of uncommitted docs not working with DV updates

2016-08-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15413235#comment-15413235
 ] 

Michael McCandless commented on LUCENE-7344:


Maybe a better option is to remove the doc values based queries.  One really 
shouldn't be querying based on doc values, i.e. the values should be indexed as 
points if range querying is going to be used.

> Deletion by query of uncommitted docs not working with DV updates
> -
>
> Key: LUCENE-7344
> URL: https://issues.apache.org/jira/browse/LUCENE-7344
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ishan Chattopadhyaya
> Attachments: LUCENE-7344.patch
>
>
> When DVs are updated, delete by query doesn't work with the updated DV value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7344) Deletion by query of uncommitted docs not working with DV updates

2016-08-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15413234#comment-15413234
 ] 

Michael McCandless commented on LUCENE-7344:


Maybe another option is to somehow make a wrapped {{LeafReader}} that exposes 
the point-in-time view for its delegate, but will also expose the live pending 
updated DVs when doc values are requested?  This would need to expose something 
that says how far (by docID) into those updates should this reader expose, per 
DBQ that's applied.   This should avoid the adversarial cases I think, or at 
least greatly minimize their impact.

> Deletion by query of uncommitted docs not working with DV updates
> -
>
> Key: LUCENE-7344
> URL: https://issues.apache.org/jira/browse/LUCENE-7344
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ishan Chattopadhyaya
> Attachments: LUCENE-7344.patch
>
>
> When DVs are updated, delete by query doesn't work with the updated DV value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7344) Deletion by query of uncommitted docs not working with DV updates

2016-08-09 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15413216#comment-15413216
 ] 

Shai Erera commented on LUCENE-7344:


Hmm ... I had to refresh my memory of the DV updates code and I agree with 
[~mikemccand] that the fix is hairy (which goes hand-in-hand with the hairy 
{{BufferedUpdatesStream}}). The problem is that deleteByQuery uses the existing 
LeafReader, but the DV updates themselves were not yet applied so the reader is 
unaware of the change.

I changed the test to call {{updateDocument}} instead of updating the NDV and 
the test passes. This is expected because updating a document deletes the old 
one and adds a new document. So when DBQ is processed, a LeafReader is opened 
on the new segment (with the new document; it has to work that way cause the 
new document isn't yet flushed) and the new segment thus has the new document 
with the updated NDV.

I agree this is a bug *only* because updating a document followed by DBQ works 
as expected. The internals of how in-place updates are applied should not 
concern the user.

I wonder if we need to implement a complex merge-sorting approach as 
[~mikemccand] proposes, or if we applied the DV updates before processing and 
DBQ would be enough (ignoring the adversarial affects that Mike describes; 
they're true, but I ignore them for the moment). I want to try that.

If that works, then perhaps we can detect if a DBQ involves an NDV field (or 
BDV field for that matter) and refresh the reader only then, or refresh the 
reader whenever there are DBQ and any DV updates, even if they are unrelated. 
But first I want to try and make the test pass, before we decide on how to 
properly fix it.

> Deletion by query of uncommitted docs not working with DV updates
> -
>
> Key: LUCENE-7344
> URL: https://issues.apache.org/jira/browse/LUCENE-7344
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ishan Chattopadhyaya
> Attachments: LUCENE-7344.patch
>
>
> When DVs are updated, delete by query doesn't work with the updated DV value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7344) Deletion by query of uncommitted docs not working with DV updates

2016-06-19 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15338486#comment-15338486
 ] 

Michael McCandless commented on LUCENE-7344:


This is indeed a bug, but I can't yet see how we can fix it.

The hairy code that resolves deletes and DV updates 
({{BufferedUpdateStream.java}} has a hard assumption that segments are 
write-once as far as queries resolving to docIDs is concerned, and updatable 
DVs breaks that assumption.

To properly fix this, at a high level, the code would need to do something 
along the lines of merge-sorting all DV updates and delete-by-query in order of 
their arrival to {{IndexWriter}} ({{docIDUpto}} proxies this), and then apply 
these in order, but after any DV update and just before a delete-by-query, we'd 
need to open a new {{LeafReader}} on that segment so it would reflect the DV 
updates (maybe just calling {{ReadersAndUpdates.writeFieldUpdates}}, and then 
pulling the reopened reader, would suffice for this).

But even if we can fix that, the code would then have an inherent terrible 
adversarial worst case of alternating DV update / DBQ.

Rather than fix this bug, I would lean towards requiring/documenting that DBQ 
must not use queries that rely on updated doc values.

> Deletion by query of uncommitted docs not working with DV updates
> -
>
> Key: LUCENE-7344
> URL: https://issues.apache.org/jira/browse/LUCENE-7344
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ishan Chattopadhyaya
> Attachments: LUCENE-7344.patch
>
>
> When DVs are updated, delete by query doesn't work with the updated DV value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org